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Preface 

This  volume  contains  the  papers  selected  for  presentation  at  the  Seventh  In¬ 
ternational  Workshop  on  Rough  Sets,  Fuzzy  Sets,  Data  Mining,  and  Granular- 
Soft  Computing  (RSFDGrC’99)  held  in  the  Yamaguchi  Resort  Center,  Ube, 
Yamaguchi,  Japan,  November  9-11,  1999.  The  workshop  was  organized  by  the 
International  Rough  Set  Society,  the  DISC  Special  Interest  Group  on  Granular 
Computing  (GrC),  the  Polish- Japanese  Institute  of  Information  Technology,  and 
Yamaguchi  University. 

RSFDGrC’99  provided  an  international  forum  for  sharing  original  research 
results  and  practical  development  experiences  among  experts  in  these  emerging 
fields.  An  important  feature  of  the  workshop  was  to  stress  the  role  of  the  inte¬ 
gration  of  intelligent  information  techniques.  That  is,  to  promote  a  deep  fusion 
of  these  approaches  to  AI,  soft  computing,  and  database  communities  in  order  to 
solve  real-world,  large,  complex  problems  concerned  with  uncertainty  and  fuzzi¬ 
ness.  In  particular,  rough  and  fuzzy  set  methods  in  data  mining  and  granular 
computing  were  on  display. 

The  total  of  89  papers  coming  from  21  countries  and  touching  a  wide 
spectrum  of  topics  related  to  both  theory  and  applications  were  submitted  to 
RSFDGrC’99.  Out  of  them  45  papers  were  selected  for  regular  presentations 
and  15  for  short  presentations.  Seven  technical  sessions  were  organized,  namely: 
Rough  Set  Theory  and  Its  Applications;  Fuzzy  Set  Theory  and  Its  Applications; 
Non-classical  Logic  and  Approximate  Reasoning;  Information  Granulation  and 
Granular  Computing;  Data  Mining  and  Knowledge  Discovery;  Machine  Learn¬ 
ing;  Intelligent  Agents  and  Systems. 

The  RSFDGrC’99  program  was  enriched  by  four  invited  speakers:  Zdzislaw 
Pawlak,  Lotfi  A.  Zadeh,  Philip  Yu,  and  Setsuo  Arikawa,  from  soft  computing, 
database,  and  AI  communities,  A  special  session  on  Rough  Computing:  Foun¬ 
dations  and  Applications  was  organized  by  James  F.  Peters. 

An  event  like  this  can  only  succeed  as  a  team  effort.  We  would  like  to 
acknowledge  the  contribution  of  the  program  committee  members  and  thank 
the  reviewers  for  their  efforts.  Many  thanks  to  the  honorary  chairs  Zdzislaw 
Pawlak  and  Lotfi  A.  Zadeh  as  well  as  the  general  chairs  Setsuo  Ohsuga  and  T.Y. 
Lin.  Their  involvement  and  support  have  added  greatly  to  the  quality  of  the 
workshop.  Our  sincere  gratitude  goes  to  all  of  the  authors  who  submitted  papers. 
We  are  grateful  to  our  sponsors:  Kayamori  Foundation  of  Informational  Science 
Advancement,  United  States  Air  Force  Asian  Office  of  Aerospace  Research  and 
Development,  and  Yamaguchi  Industrial  Technology  Development  Organizer, 
for  their  generous  support.  We  wish  to  express  our  thanks  to  Alfred  Hofmann 
of  Springer- Verlag  for  his  help  and  cooperation. 
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Abstract.  This  paper  concerns  a  relationship  between  Bayes’  inference 
rule  and  decision  rules  from  the  rough  set  perspective. 

In  statistical  inference  based  on  the  Bayes’  rule  it  is  assumed  that  some 
prior  knowledge  (prior  probability)  about  some  parameters  without  knowl¬ 
edge  about  the  data  is  given  first.  Next  the  posterior  probability  is  com¬ 
puted  by  employing  the  available  data.  The  posterior  probability  is  then 
used  to  verify  the  prior  probability. 

In  the  rough  set  philosophy  with  every  decision  rule  two  conditional  prob¬ 
abilities,  called  certainty  and  coverage  factors,  are  associated.  These  two 
factors  are  closely  related  with  the  lower  and  the  upper  approximation 
of  a  set,  basic  notions  of  rough  set  theory.  Besides,  it  is  revealed  that 
these  two  feictors  satisfy  the  Bayes’  rule.  That  means  that  we  can  use  to 
data  amalysis  the  Bayes’  rule  of  inference  without  referring  to  Bayesian 
philosophy  of  prior  and  posterior  probabilities. 

Key  words:  Bayes’  rule,  rough  sets,  decision  rules,  information  system 


1  Introduction 

This  paper  is  an  extended  version  of  the  author’s  ideas  presented  in  [5, 6, 7, 8]. 
It  concerns  some  relationships  between  probability,  logic  and  rough  sets  and  it 
refers  to  some  concepts  of  Lukasiewicz  presented  in  [3]. 

We  will  dwell  in  this  paper  upon  the  Bayesian  philosophy  of  data  analysis 
and  that  proposed  by  rough  set  theory. 

Statistical  inference  grounded  on  the  Bayes’  rule  supposes  that  some  prior 
knowledge  (prior  probability)  about  some  parameters  without  knowledge  about 
the  data  is  given  first.  Next  the  posterior  probability  is  computed  when  the 
data  are  available.  The  posterior  probability  is  then  used  to  verify  the  prior 
probability. 

In  the  rough  set  philosophy  with  every  decision  rule  two  conditional  proba¬ 
bilities,  called  certainty  and  coverage  factors,  are  associated.  These  two  factors 
are  closely  related  with  the  lower  and  the  upper  approximation  of  a  set,  basic 
concepts  of  rough  set  theory.  Besides,  it  turned  out  that  these  two  factors  satisfy 
the  Bayes’  rule.  That  means  that  we  can  use  to  data  analysis  the  Bayes’  rule 
of  inference  without  referring  to  Bayesian  philosophy,  i.e.,  to  the  prior  and  pos¬ 
terior  probabilities.  In  other  words,  every  data  set  with  distinguished  condition 
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and  decision  attributes  satisfies  the  Bayes’  rule.  This  property  gives  a  new  look 
on  reasoning  methods  about  data. 

2  Information  System  and  Decision  Table 

Starting  point  of  rough  set  based  data  analysis  is  a  data  set,  called  an  information 
system. 

An  information  system  is  a  data  table,  whose  columns  are  labelled  by  at¬ 
tributes,  rows  are  labelled  by  objects  of  interest  and  entries  of  the  table  are 
attribute  values. 

Formally  by  an  information  system  we  will  understand  a  pair  S  =  (C/,  A) , 
where  U  and  A,  are  finite,  nonempty  sets  called  the  universe^  and  the  set  of 
attributes,  respectively.  With  every  attribute  a  €  A  we  associate  a  set  14,  of  its 
values,  called  the  domain  of  a.  Any  subset  B  of  A  determines  a  binary  relation 
1(B)  on  U,  which  will  be  called  an  indiscemibility  relation,  and  is  defined  as 
follows:  ix,y)  €  1(B)  if  and  only  if  a(x)  =  a(y)  for  every  a  6  A,  where  a(x) 
denotes  the  value  of  attribute  a  for  element  x.  Obviously  I (B)  is  an  equivalence 
relation.  The  family  of  all  equivalence  classes  of  1(B),  i.e.,  partition  determined 
by  B,  will  be  denoted  by  U  11(B),  or  simple  U/B',  an  equivalence  class  oi  1(B), 
i.e.,  block  of  the  partition  UjB,  containing  x  will  be  denoted  by  B(x). 

If  (x,y)  belongs  to  1(B)  we  will  say  that  x  and  y  are  B-indiscemible  or 
indiscernible  with  respect  to  B.  Equivalence  classes  of  the  relation  1(B)  (or 
blocks  of  the  partition  U/B)  are  referred  to  as  B-elementary  sets  or  B-granules. 

If  we  distinguish  in  an  information  system  two  classes  of  attributes,  called 
condition  and  decision  attributes,  respectively,  then  the  system  will  be  called  a 
decision  table. 

A  simple,  tutorial  example  of  an  information  system  (a  decision  table)  is 
shown  in  Table  1. 


Table  1.  An  example  of  a  decision  table 


Car 

F 

P 

S 

M 

1 

med. 

med. 

med. 

poor 

2 

high 

med. 

large 

poor 

3 

med. 

low 

large 

poor 

4 

low 

med. 

med. 

good 

5 

high 

low 

small 

poor 

6 

med. 

low 

large 

good 

The  table  contains  data  about  six  cars,  where  F,  P,  S  and  M  denote  fuel 
consumption,  selling  price,  size  and  marketability,  respectively. 
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Attributes  F,P  and  S  are  condition  attributes,  whereas  M  is  the  decision 
attribute.  Each  row  of  the  decision  table  determines  a  decision  obeyed  when 
specified  conditions  are  satisfied. 

3  Approximations 

Suppose  we  are  given  an  information  system  (a  datat  set)  S  =  (?7,  A),  a  subset 
X  of  the  universe  U,  and  subset  of  attributes  B.  Our  task  is  to  describe  the  set 
X  in  terms  of  attribute  values  from  B.  To  this  end  we  define  two  operations 
assigning  to  every  X  C  U  two  sets  B^{X)  and  B*{X)  called  the  B-lower  and 
the  B-upper  approximation  of  X,  respectively,  and  defined  as  follows: 

B,{X)  =  U  {B(x)  :  B{x)  C  X), 
xeu 

B*{X)  =  U  {B(x)  :  B{x)nX  /  0}. 
xeu 

Hence,  the  H-lower  approximation  of  a  set  is  the  union  of  all  H-granules  that  are 
included  in  the  set,  whereas  the  B~upper  approximation  of  a  set  is  the  union  of 
all  H-granules  that  have  a  nonempty  intersection  with  the  set.  The  set 

BNb{X)  B*(X)  -  B^{X) 

will  be  referred  to  as  the  B~boundary  region  of  X, 

If  the  boundary  region  of  X  is  the  empty  set,  i.e.,  BNb{X)  =  0,  then  X  is 
crisp  {exact)  with  respect  to  B;  in  the  opposite  case,  i.e.,  if  BNb{X)  ^  0,  X  is 
referred  to  as  rough  (inexact)  with  respect  to  B. 

For  example,  let  C  =  {F,  F,  5}  be  the  set  of  all  condition  attributes.  Then  for 
the  set  X  =  {1, 2, 3, 5}  of  cars  with  poor  marketability  we  have  C*(X)  =  {1, 2, 5}, 
C*(X)  =  {1,2,3, 5,6}  and  BNc{X)  =  {3,6}. 

4  Decision  Rules 

With  every  information  system  S  =  (C/,  A)  we  associate  a  formal  language  L{S)^ 
written  L  when  S  is  understood.  Expressions  of  the  language  L  are  logical  for¬ 
mulas  denoted  by  etc.  built  up  from  attributes  and  attribute- value  pairs  by 
means  of  logical  connectives  A  (and),  V  (or),  (not)  in  the  standard  way.  We 
will  denote  by  |l^||s  the  set  of  all  objects  x  e  U  satisfying  ^  in  S  and  refer  to 
as  the  meaning  of  ^  in  5. 

The  meaning  of  <?  in  5  is  defined  inductively  as  follows: 

1)  =  {v  £U  :  a{v)  =  U}  for  all  a  G  A  and  v  e  K, 

2)  11^  V  0^115  =  ||^||sU||lZ^l|5, 

3)  ||^Aa^||5  =  ||^||sn||a^||5, 

4)  II =  11^115. 


A  formula  ^  is  true  in  5  if  ||^||5  =  C/. 

A  decision  rule  in  L  is  an  expression  ^  read  if  ^  then  ^  and  ^  are 
referred  to  as  conditions  and  decisions  of  the  rule,  respectively. 

An  example  of  a  decision  rule  is  given  below 


{F^med.)  A  (PJow)  A  {S, large)  {M,poor). 


Obviously  a  decision  rule  ^  is  true  in  5  if  ||^||5  O  ||!Z'||5. 

With  every  decision  rule  ^  ^  we  associate  a  conditional  probability 

7r5(??'|^)  that  ^  is  true  in  S  given  is  true  in  S  with  the  probability 

7r5(^)  >  called  the  certainty  factor  and  defined  as  follows: 


card{\\^  A  iP'Hs) 

cardiW^Ws) 


where  ||<?||5  ^  0. 

This  coefficient  is  widly  used  in  data  mining  and  is  called  "confidence  coeffi¬ 
cient”  . 

Obviously,  =  1  if  and  only  if  ^  is  true  in  S. 

If  7r5(^'|^)  =  1,  then  ^  will  be  called  a  certain  decision  rule;  if 
0  <  7r5(!Z'|^)  <  1  the  decision  rule  will  be  referred  to  as  a  possible  decision  rule. 
Besides,  we  will  also  need  a  coverage  factor 


7rs(^\^)  — 


card(||^  A^lls) 
cardiW^Ws)  ’ 


which  is  the  conditional  probability  that  ^  is  true  in  5,  given  W  is  true  in  S  with 
the  probability  7rs{^)> 

Certainty  and  coverage  factors  for  decision  rules  associated  with  Table  1  are 
given  in  Table  2. 


Table  2.  Certainty  and  coverage  factors 


Car 

F 

P 

S 

M 

Cert. 

Cov. 

1 

med. 

med. 

med. 

poor 

1 

1/4 

2 

high 

med. 

large 

poor 

1 

1/4 

3 

med. 

low 

large 

poor 

1/2 

1/4 

4 

low 

med. 

med. 

good 

1 

1/2 

5 

high 

low 

small 

poor 

1 

1/4 

6 

med. 

low 

large 

good 

1/2 

1/2 

More  about  managing  uncertainty  in  decision  rules  can  be  found  in  [2]. 
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5  Decision  Rules  and  Approximations 

Let  be  a  set  of  decision  rules  such  that: 


all  conditions  are  pairwise  mutally  exclusive,  i.e.,  A  45j||5  =  0,for  any 
1  <  j,  and  (1) 

n 

=  1. 

i=l 

Let  C  and  D  be  condition  and  decision  attributes,  respectively,  and  let 
{^i  ^}n  be  a  set  of  decision  rules  satisfying  (1). 

Then  the  following  relationships  are  valid: 

a)  C7.(||-?||s)  =  II  V 

TT  (If' 1 4^0=1 

b)  C*(||!?||s)  =  II  V 

c)  BATcdllPlls)  =  II  V  =  U 

0<7r(<?|4>i)<l  i=l 

The  above  properties  enable  us  to  introduce  the  following  definitions: 

i)  If  11^115  =  C'*(||?2'||5),  then  formula  ^  will  be  called  the  C-lower  approxima¬ 
tion  of  the  formula  ^  and  will  be  denoted  by 

ii)  If  11^115  =  C^dltf'lls),  then  the  formula  ^  will  be  called  the  C-upper  approx¬ 
imation  of  the  formula  #  and  will  be  denoted  by 

iii)  If  ||^||5  =  then  ^  will  be  called  the  C-houndary  of  the  formula 

and  will  be  denoted  by  BNc{^)> 

Let  us  consider  the  following  example. 

The  (Slower  approximation  of  {M,  poor)  is  the  formula 

C^{Mypoor)  =  ((F,77ied.)  A  {P,med.)  A  {S^med.))  V 
{{F^high)  A  (F,med.)  A  {S,  large))  V 
((F,  high)  A  (P,  low)  A  (5,  small)). 

The  C-upper  approximation  of  {M,  poor)  is  the  formula 

C*  {My  poor)  =  ((F,med.)  A  (P,med.)  A  (S',med.))  V 
((F,  high)  A  (F,  med.)  A  (5,  large))  V 
((F,med.)  A  {Pylow)  A  (Sylarge))  V 
((F,  high)  A  (F,  low)  A  (5,  small)). 

The  C-boundary  of  {M,  poor)  is  the  formula 

BNc{Mypoor)  =  ((F,med.)  A  (Pylow)  V  (Sylarge)). 
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After  simplification  we  get  the  following  approximations 

C^{M,poor)  =  {{Fanned.)  A  (F,med.))  V  (F^high), 
C*{M,poor)  =  {F,med.)  V  {F^high). 


The  concepts  of  the  lower  and  upper  approximation  of  a  decision  allow  us  to 
define  the  following  decision  rules: 

For  example,  from  the  approximations  given  in  the  example  above  we  get  the 
following  decision  rules: 

{{F,med.)  A  (P,med.))  V  {F,high)  {M,  poor), 

{F,med.)  V  (F,high)  ->  (M,  poor), 

{{F,med.)  A  (P,low)  A  {S, large))  {M,poor). 

From  these  definitions  it  follows  that  any  decision  ^  can  be  uniquely  discribed 
by  the  following  two  decision  rules: 

aw 

BNc{^) 

From  the  above  calculations  we  can  get  two  decision  rules 

{{F,med.)  A  {P,med.))  V  {F,  high)  ->■  {M,poor), 

l{F,7ned.)  A  {P,low.)  A  {S, large))  {M,poor), 

which  are  associated  with  the  lower  approximation  and  the  boudary  region 
of  the  decision  {M,  poor),  respectively  and  describe  decision  {M,  poor). 

Obviously  we  can  get  similar  decision  rules  for  the  decision  {M,  good)  which 
are  as  follows: 

(Fjlow)  {M,good), 

{(F,7ned.)  A  {P,low.)  A  {S, large))  {M,good). 

This  coincides  with  the  idea  given  by  Ziarko  [15]  to  represent  decision  tables 
by  means  of  three  decision  rules  corresponding  to  positive  region  the  boundary 
region,  and  the  negative  region  of  a  decision. 
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6  Decision  Rules  and  Bayes’  Rules 


If  {^i  ^}n  is  a  set  of  decision  rules  satisfying  condition  (1),  then  the  well 

known  formula  for  total  probability  holds: 


=  (2) 

t=l 


Moreover  for  any  decision  rule  ^  the  following  Bayes’  rule  is  valid: 


TTsi^jl^)  = 


(3) 


That  is,  any  decision  table  or  any  set  of  implications  satisfying  condition  (1) 
satisfies  the  Bayes’  rule,  without  referring  to  prior  and  posterior  probablities  - 
fundamental  in  Baysian  data  analysis  philosophy,  Bayes’  rule  in  our  case  says 
that:  if  an  implication  ^  3^  is  true  to  the  degree  7rs(!Z'|^)  then  the  implication 

S'  is  true  to  the  degree  7r5(<?|tf'). 

This  idea  can  be  seen  as  a  generalization  of  a  modus  tollens  inference  rule, 
which  says  that  if  the  implication  ^  -4  ??  is  true  so  is  the  implication 
For  example,  for  the  set  of  decision  rules 


({F,med.)  A  (P,med.))  V  (F,  high)  -4  (M,poor), 

((F,med.)  A  {P,low)  A  {Sflarge))  -4  (M,poor), 

(FjIow)  -4  {M,good)y 

{{F,med.)  A  {P,low)  A  {S, large))  {M,good)^ 

we  get  the  values  of  ceratinty  and  coverage  factors  shown  in  Table  3. 


Table  3.  Initial  decision  rules 


Rule 

Decision 

Certainty 

Coverage 

certain 

poor 

1 

boundary 

poor 

1/2 

B9 

certain 

good 

1 

BB 

boundary 

good 

1/2 

■■ 

The  above  set  of  decison  rules  can  be  ’’reversed”  as 

{M,poor)  -¥  {{F,med.)  A  (P,med.))  V  (F,  high), 
{M,poor)  ->  ({F,m€d.)  A  (P^low)  A  {S, large)), 
{M,goo(i)  -4  {F,low), 
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{M,good)  -4  {{F,med.)  A  {P,low)  A  {S, large)). 

Due  to  Bayes’  rule  the  certainty  and  coverage  factors  for  inverted  decision 
rules  are  mutually  exchanged  as  shown  in  Table  4  below. 


Table  4.  Reversed  decision  rules 


Rule 

Decision 

Certainty 

Coverage 

certain 

poor 

3/4 

1 

boundary 

poor 

1/4 

1/2 

certain 

good 

1/2 

1 

boundary 

good 

1/2 

1/2 

This  property  can  be  used  to  reason  about  data  in  the  way  similar  to  that 
allowed  by  modus  tollens  inference  rule  in  classical  logic. 

7  Conclusions 

It  is  shown  in  this  paper  that  any  decision  table  satisfies  Bayes’  rule.  This  en¬ 
ables  to  apply  Bayes’  rule  of  inference  without  referring  to  prior  and  posterior 
probabilities,  inherently  associated  with  ’’classical”  Bayesian  inference  philoso¬ 
phy.  From  data  tables  one  can  extract  decision  rules  -  implications  labelled  by 
certainty  factors  expressing  their  degree  of  truth.  The  factors  can  be  computed 
from  data.  Moreover,  one  can  compute  from  data  the  coverage  degrees  expressing 
the  truth  degrees  of ’’reverse”  implications.  This  can  be  treated  as  generalization 
of  modus  tollens  inference  rule. 
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In  one  form  or  another,  decision  processes  play  a  pivotal  role  in  systems 
analysis.  Decisions  are  based  on  information.  More  often  than  not,  decision¬ 
relevant  information  is  a  mixture  of  measurements  and  perceptions. 

It  is  a  long-standing  tradition  in  science  to  deal  with  perceptions  by  convert¬ 
ing  them  into  measurements.  As  is  true  of  every  tradition,  a  time  comes  when 
the  underlying  assumptions  cease  to  be  beyond  question. 

A  thesis  advanced  in  our  work  is  that  closer  analysis  leads  to  the  conclusion 
that  in  most  fields  of  science  —  and  especially  in  systems  analysis  —  conversion  of 
perceptions  into  measurements  is,  in  many  cases,  infeasible,  unrealistic  or  coun¬ 
terproductive.  The  alternative  is  to  develop  a  machinery  for  computation  with 
perceptions  which  exploits  the  vast  computational  power  of  modern  computers. 
In  essence,  this  is  the  aim  of  the  computational  theory  of  perceptions  (CTP). 
Somewhat  paradoxically,  the  source  of  inspiration  for  this  theory  is  the  remark¬ 
able  human  capability  to  perform  a  wide  variety  of  physical  and  mental  tasks 
without  any  measurements  and  any  computations.  Underlying  this  capability 
is  the  brain^s  crucial  ability  to  manipulate  perceptions  -  perceptions  of  time, 
distance,  direction,  speed,  force,  shape,  color,  similarity,  likelihood,  intent  and 
truth,  among  others. 

The  point  of  departure  in  the  computational  theory  of  perceptions  -  the  first 
stage  in  the  reasoning  process  -  is  conversion  of  perceptions  into  propositions 
expressed  in  a  natural  language,  with  a  proposition  viewed  as  a  carrier  of  in¬ 
formation  which  provides  an  answer  to  a  question.  A  key  idea  in  CTP  is  that 
the  meaning  of  a  proposition  may  be  represented  as  a  generalized  constraint 
on  a  variable.  This  idea  forms  the  basis  for  what  is  called  constraint- centered 
semantics  of  natural  languages  (CSNL). 

The  second  stage  in  CTP  involves  translation  of  propositions  in  the  ini¬ 
tial  data  set  into  the  constraint  language  GCL,  resulting  in  a  collection  of  an¬ 
tecedent  constraints  which  constitute  the  initial  constraint  set  ICS.  The  third 
stage  involves  goal- directed  propagation  of  initial  constraints  augmented  with 
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decision-relevant  constraints  induced  by  an  external  knowledgebase.  The  goal  is 
a  terminal  constraint  set  in  GCL  which  upon  retranslation  -  the  fourth  and  last 
stage  -  yields  the  end  result  of  the  reasoning  process. 

Existing  theories,  especially  probability  theory,  decision  analysis  and  systems 
analysis,  lack  the  capability  to  operate  on  information  which  is  perception-based 
rather  than  measurement-  based.  The  primary  objective  of  the  computational 
theory  of  perceptions  is  to  add  such  capability  to  existing  theories  and  thereby 
enhance  their  ability  to  deal  with  real  world  problems  in  an  environment  of 
imprecision,  uncertainty  and  partial  truth. 


On  Text  Mining  Techniques  for  Personalization 


Charu  C.  Aggarwal  and  Philip  S.  Yu 
IBM  T.  J.  Watson  Research  Center,  Yorktown  Heights,  NY  10598 


Abstract.  The  popularity  of  the  Web  has  made  text  mining  techniques 
for  personalization  an  increasingly  important  research  topic.  We  first  ex¬ 
amine  the  problem  on  text  mining  for  building  categorization  systems. 
Three  different  approaches  which  can  be  used  for  building  categorization 
systems  are  discussed:  classification,  clustering  and  partial  supervision. 
We  examine  the  advantages  and  disadvantages  of  each  approach.  Some 
Web  specific  enhancements  are  discussed.  Applications  of  text  mining 
techniques  to  collaborative  filtering  have  then  been  examined.  Specifi¬ 
cally,  a  content-based  collaborative  filtering  approach  is  considered. 


1  Introduction 

The  increased  amount  of  online  text  data  on  the  Web  has  led  to  the  need  for 
improved  text  mining  techniques  for  personalization.  In  this  paper  we  will  discuss 
two  important  applications: 

—  Categorization  Systems:  In  categorization  systems,  we  wish  to  provide 
the  ability  of  classifying  documents  into  categories  in  an  automated  way. 
These  categories  may  either  be  pre-decided  from  a  training  data  set  or  may 
be  generated  using  a  clustering  algorithm.  Various  tradeoffs  will  be  discussed 
in  this  paper.  Some  web  specific  extensions  for  categorization  are  examined. 

—  Collaborative  Filtering  Systems;  Collaborative  filtering  systems  [18,  12, 
2]  are  very  applicable  to  electronic  commerce  sites  in  which  purchases  made 
by  customers  can  be  tracked.  The  record  of  purchases  or  Web  pages  browsed 
may  be  used  in  order  to  determine  like-minded  peer  groups  and  make  rec¬ 
ommendations  for  individual  customers  based  on  the  behavior  of  their  peer 
groups.  In  content  based  collaborative  filtering  methods  [6],  a  content  char¬ 
acterization  of  the  product  or  Web  page  is  being  used  on  the  peer  group 
formation  to  make  these  recommendations.  Text  mining  techniques  are  very 
effective  in  using  these  content  characterizations  in  order  to  provide  recom¬ 
mendations. 


2  Categorization  Systems 

Categorization  systems  have  become  increasingly  important  because  of  the  need 
to  classify  large  online  repositories  in  a  structured  way.  This  can  be  very  use¬ 
ful  for  personalization  applications  in  which  the  analysis  of  textual  material 
browsed  on  an  E-commerce  site  by  an  online  customer  is  used  in  order  to  make 
recommendations. 
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Categorization  systems  can  be  built  with  or  without  supervision  from  a  pre¬ 
existing  set  of  classes  in  another  taxonomy.  Several  tradeoffs  are  possible  and 
have  been  discussed  in  [1].  These  tradeoffs  are  as  follows: 

-  Unsupervised  systems:  In  unsupervised  systems,  clustering  methods  are 
used  in  order  to  create  sets  of  clcisses.  These  classes  are  then  used  in  order 
to  perform  the  recommendations.  Examples  of  such  systems  include  those 
discussed  in  [9,  10],  Improved  methods  for  text  clustering  have  also  been 
discussed  in  [4,  9,  10,  17,  19,  21].  The  advantage  of  this  system  is  that  the 
same  measures  which  are  used  for  clustering  may  be  used  for  categorization. 
Thus,  this  system  has  100%  accuracy,  though  the  actual  quality  of  catego¬ 
rization  is  dependent  on  the  nature  of  the  initial  clustering.  Such  systems 
may  not  be  too  useful  for  personalization  because  it  is  not  possible  to  con¬ 
trol  the  range  of  categories  that  the  system  can  address.  Furthermore,  it  is 
difficult  to  create  effective  fine  grained  subject  isolation  using  unsupervised 
techniques. 

-  Supervised  Systems:  In  supervised  systems,  a  pre-existing  sample  of  doc¬ 
uments  with  the  associated  classes  is  available  in  order  to  provide  the  super¬ 
vision  to  the  categorization  system.  A  training  procedure  is  applied  to  this 
sample  which  models  the  relationship  between  the  training  data  and  the  set 
of  classes.  Several  text  classifiers  have  recently  been  proposed  [5,  7,  14,  15]. 
Although  these  methods  seem  to  work  well  on  structured  collections  such  as 
the  US  patent  database  or  the  Reuters  data  set,  the  systems  do  not  work 
well  for  heterogeneous  collections  of  documents  such  as  those  on  the  Web. 
This  is  primarily  because  of  the  varying  style,  authorship  and  vocabulary  in 
different  documents.  For  example,  it  has  been  shown  in  [7]  that  a  training 
procedure  on  the  Yahoo\  taxonomy  achieves  only  32%  accuracy,  whereas 
the  same  algorithm  achieves  much  greater  success  (more  that  66%)  with 
the  US  patent  database  and  Reuters  data  sets.  Clearly,  text  data  on  the 
Web  provides  special  problems  in  terms  of  fitting  the  training  data  into  any 
particular  model. 

-  Partially  Supervised  Clustering:  We  have  developed  a  new  approach 
on  categorization  systems,  referred  to  as  the  partially  supervised  approach, 
where  an  initial  training  data  set  is  used  in  order  to  partially  supervise  the 
creation  of  a  new  set  of  classes.  This  results  in  a  categorization  system  in 
which  it  is  possible  to  have  some  control  over  the  range  of  subjects  that  one 
would  like  the  categorization  system  to  address,  but  with  a  precise  automated 
definition  of  how  each  cluster  is  defined.  The  definition  of  the  clusters  may 
then  be  used  for  the  categorization  process.  The  details  of  such  a  categoriza¬ 
tion  system  can  be  found  in  [1]  which  uses  a  projected  clustering  technique 
[3]  to  handle  high  dimensional  clustering,  and  has  shown  to  be  more  effective 
than  either  purely  supervised  or  unsupervised  clustering.  This  is  because  the 
supervision  ensures  that  one  is  able  to  create  reasonably  fine  grained  subject 
isolation  which  is  related  to  the  original  taxonomy.  At  the  same  time,  the 
system  is  very  suited  to  automated  categorization.  We  have  used  this  system 
to  categorize  Web  pages  for  personalized  news  feed. 
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It  is  possible  to  improve  the  accuracy  of  these  systems  on  the  Web  further  by 
adding  certain  Web-specific  extensions.  One  interesting  method  for  performing 
enhanced  categorization  is  by  using  the  information  which  is  latent  in  hyperlinks 
[8].  Web  pages  tends  to  link  with  one  another  based  on  a  proximity  in  the  general 
subject  areas  which  are  discussed  in  each  page.  This  information  can  be  used 
in  order  to  improve  classifier  performance  by  examining  the  content  of  the  Web 
pages  which  are  linked  to  by  the  current  page.  Such  a  method  has  been  discussed 
in  [8]. 

3  Content  Based  Collaborative  Filtering  Systems 

In  this  section,  we  will  discuss  our  work  on  an  application  of  clustering  to  pro¬ 
vide  a  generalization  of  the  collaborative  filtering  concept  to  combine  it  with 
content  based  filtering  [16],  where  recommendations  are  made  on  products  with 
similar  characteristic  to  the  products  likened  by  a  customer.  This  is  referred  to 
as  the  content  based  collaborative  filtering  approach  [6].  Content  based  collab¬ 
orative  filtering  systems  are  useful  in  providing  personalized  recommendations 
at  an  E-commerce  site.  In  such  systems,  a  past  history  of  customer  behavior  is 
available,  which  may  be  used  for  making  future  recommendations  for  individ¬ 
ual  customers.  We  also  assume  that  a  “content  characterization”  of  products 
is  available  in  order  to  perform  recommendations.  These  characterizations  may 
be  (but  are  not  restricted  to)  the  text  description  of  the  products  which  are 
available  at  the  Web  site.  The  key  here  is  that  the  characterizations  should  be 
such  that  they  contain  attributes  (or  textual  words)  which  are  highly  correlated 
with  buying  behavior.  In  this  sense,  using  carefully  defined  content  attributes 
which  are  specific  to  the  domain  knowledge  in  question  can  be  very  useful  for 
making  recommendations.  For  example,  in  an  engine  which  recommends  CDs, 
the  nature  of  the  characterizations  could  be  the  singer  name,  music  category, 
composer  etc.,  since  all  of  these  attributes  are  likely  to  be  highly  correlated  with 
buying  behavior.  On  the  other  hand,  if  the  only  information  available  is  the  raw 
textual  description  of  the  products,  then  it  may  be  desirable  to  use  some  kind 
of  feature  selection  process  in  order  to  decide  which  words  are  most  relevant  to 
the  process  of  making  recommendations. 

We  will  now  proceed  to  describe  the  overall  process  and  method  of  our  ap¬ 
proach  for  performing  content-based  collaborative  filtering.  This  collaborative 
filtering  process  consists  of  the  following  sequence  of  steps,  all  of  which  are 
shown  in  Figure  1. 

(1)  Feature  Selection:  It  is  possible  that  the  initial  characterization  of  the 
products  is  quite  noisy,  and  not  all  of  the  textual  descriptions  are  directly 
related  to  buying  behavior.  For  example,  stop  words  (commonly  occurring 
words  in  the  language)  in  the  description  are  unlikely  to  have  much  con¬ 
nection  with  the  buying  pattern  in  the  products.  In  order  to  perform  the 
feature  selection,  we  perform  the  following  process:  we  first  create  a  prelim¬ 
inary  customer  characterization  by  concatenating  the  text  descriptions  for 
each  product  bought  by  the  customer.  Let  the  set  of  words  in  the  lexicon 
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Fig.  1.  Content  Based  Collaborative  Filtering 


describing  the  products  be  indexed  by  i  G  {1, . . and  let  the  set  of  cus¬ 
tomers  j  for  which  buying  behavior  is  available  be  indexed  hy  j  G  {1, . . . ,  n}. 
The  frequency  of  presence  of  word  i  in  customer  characterization  j  is  denoted 
by  F{iyj).  The  fractional  presence  of  a  word  i  for  customer  j  is  denoted  by 
P{hj)  and  is  defined  as  follows: 


P{iJ)  = 


Pjhj) 

SjeAll  customers 


(1) 


Note  that  when  the  word  i  =  io  is  noisy  in  its  distribution  across  the  different 
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products,  then  the  values  of  P{ioJ)  are  likely  to  be  similar  for  different 
values  of  The  gini  index  for  the  word  i  is  denoted  by  and  is  defined 
as  follows:  _ 

G(io)  =  1  -  (2) 

When  the  word  io  is  noisy  in  its  distribution  across  the  different  customers, 
then  the  value  of  G(io)  is  high.  Thus,  in  order  to  pick  the  content^  which 
is  most  discriminating  in  behavioral  patterns,  we  pick  the  words  with  the 
lowest  gini  index.  The  process  of  finding  the  words  with  the  lowest  gini  index 
is  indicated  in  Step  1  of  Figure  1. 

(2)  Refining  Product  Characterizations:  In  the  second  stage  of  the  proce¬ 
dure,  we  refine  the  product  characterizations  from  the  original  text  descrip¬ 
tion.  To  do  so,  we  prune  the  content  characterizations  of  each  product  by 
removing  those  features  or  words  which  have  high  gini  index. 

(3)  Refining  Customer  Characterizations:  In  the  third  stage  of  the  proce¬ 
dure,  we  improve  the  customer  characterizations  from  the  text  descriptions 
by  concatenating  the  refined  content  characterizations  (derived  in  the  pre¬ 
vious  step)  of  the  products  bought  by  the  individual  consumers.  We  create 
customer  characterizations  by  concatenating  together  these  pruned  product 
characterizations  for  a  given  customer. 

(4)  Clustering:  In  the  fourth  stage,  we  use  the  selected  features  in  order  to 
perform  the  clustering  of  the  customers  into  peer  groups.  This^  clustering 
can  either  be  done  using  unsupervised  methods,  or  by  supervision  from  a 
pre-existing  set  of  classes  of  products  such  that  the  classification  is  directly 
related  to  buying  behavior. 

(5)  Making  Recommendations;  In  the  final  stage,  we  make  recommendations 
for  the  different  sets  of  customers.  In  order  to  make  the  recommendations  for 
a  given  customer,  we  find  the  closest  sets  of  clusters  for  the  content  charac¬ 
terization  of  that  customer.  Finding  the  content  characterization  for  a  given 
customer  may  sometimes  be  a  little  tricky  in  that  a  weighted  concatenation 
of  the  content  characterizations  of  the  individual  products  bought  by  that 
customer  may  be  needed.  The  weighting  may  be  done  in  different  ways  by 
giving  greater  weightage  to  the  more  recent  set  of  products  bought  by  the 
customer.  The  set  of  entities  in  this  closest  set  of  clusters  forms  the  peer 
group.  The  buying  behavior  of  this  peer  group  is  used  in  order  to  make 
recommendations.  Specifically,  the  most  frequently  bought  products  in  this 
peer  group  may  be  used  as  the  recommendations.  Several  variations  of  the 
nature  of  queries  are  possible,  and  are  discussed  subsequently. 

We  have  implemented  these  approaches  in  a  content-based  mining  engine  for 
making  recommendations,  and  it  seems  to  provide  significantly  more  effective 
results  than  a  simple  clustering  engine  which  uses  only  the  identity  attributes  of 
the  products  in  order  to  do  the  clustering. 

Several  kinds  of  queries  may  be  resolved  using  such  a  system  by  using  minor 
variations  of  the  method  discussed  for  making  recommendations: 
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(1)  For  a  given  set  of  products  browsed/bought,  find  the  best  recommendation 
list. 

(2)  For  a  given  customer  and  a  set  of  products  browsed/b ought  by  him  in  the 
current  session,  find  the  best  set  of  products  for  that  customer. 

(3)  For  a  given  customer,  find  the  best  set  of  products  for  that  customer. 

(4)  For  the  queries  (1),  (2),  and  (3)  above,  find  the  recommendation  list  out  of 
a  pre-specified  promotion  list. 

(5)  Find  the  closest  peers  for  a  given  customer. 

(6)  Find  the  profile  of  the  customers  who  will  like  a  product  the  most. 

Most  of  the  above  queries  (with  the  exception  of  (6))  can  be  solved  by  using 
a  different  content  characterization  for  the  customer,  and  using  this  content 
characterization  in  order  to  find  the  peer  group  for  the  customer.  For  the  case 
of  query  (6),  we  first  find  the  peer  group  for  the  content  characterization  of  the 
current  product,  and  then  find  the  dominant  profile  characteristics  of  this  group 
of  customers.  In  order  to  do  so,  the  quantitative  association  rule  method  [20] 
may  be  used. 

Another  related  application  which  we  are  working  on  is  to  provide  user  pro¬ 
filing  on  Web  browsing  patterns  by  categorizing  the  Web  pages  browsed  by  each 
person  so  as  to  identify  the  categories  of  interests  to  a  person.  Once  the  user 
characterization  or  profile  is  built,  content-based  collaborative  filtering  can  then 
be  applied.  Here  we  have  used  the  categorization  system  based  on  the  partially 
supervised  clustering  approach  to  categorize  the  Web  pages.  Now  the  product 
characterization  is  replaced  by  the  Web  page  categorization.  The  user  catego¬ 
rization  is  the  concatenation  of  the  categories  of  Web  pages  browsed  by  a  user. 

4  Conclusions  and  Summary 

In  this  paper,  we  discussed  some  categorization  and  clustering  methods  based 
on  text  mining,  and  their  applications  to  content  based  collaborative  filtering 
systems.  With  the  recent  increase  in  the  popularity  of  the  World  Wide  Web  for 
electronic  commerce,  such  systems  are  very  useful  for  improving  the  efficiency 
of  target  marketing  techniques.  Specifically,  such  methods  may  be  very  useful  in 
performing  one-to-one  sales  promotions. 
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Abstract.  Rough  mereology  is  a  paradigm  allowing  for  a  synthesis  of 
main  ideas  of  two  potent  paradigms  for  reasoning  under  uncertainty  : 
fuzzy  set  theory  and  rough  set  theory.  In  this  work,  we  demonstrate 
applications  of  rough  mereology  to  the  important  theoretical  ideas  put 
forth  by  Lotfi  Zadeh  [9],  [10]:  Granularity  of  Knowledge  and  Computing 
with  Words. 
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1  Introduction 

We  refer  the  reader  to  [2],  [4],  [5],  [6],  [7]  for  the  basic  notions  of  rough  set 
theory  and  rough  mereology.  Here,  we  begin  with  the  notion  of  a  pre-granule 
of  knowledge.  Given  either  an  information  system  or  a  decision  system  A  = 
(U,  A)  (resp.  A  =  (U,  A,  d)),  we  define  information  sets  of  objects  via  Infsiu)  = 
{(a,  a(w))  :  a  C  B}  and  we  express  the  indiscernibility  of  objects  as  the  identity 
of  their  information  sets  :  INDb{u^w)  is  TRUE  iff  Infsiu)  =  Infsiw)  for 
any  pair  u,  w  of  objects  in  U.  The  (boolean)  algebra  generated  over  the  set  of 
atoms  U / IN Db  by  means  of  set  -  theoretical  operations  of  union,  intersection 
and  complement  is  said  to  be  the  B-algebra  CG(B)  of  pre  -  granules. 


1.1  Granules  of  Knowledge:  Rough  set  approach 

In  the  language  of  granules,  we  may  express  partial  dependencies  between  sets 
B,C  oi  attributes  by  relating  classes  of  INDb  to  classes  of  INDc-  We  will 
call,  accordingly,  a  (R,  C)  —  granule  any  pair  (G,  G')  where  G  €  CG{B)  and 
G'  C  CG{C).  Clearly,  given  a  pre  -  granule  G  G  CG{B),  there  exists  a  formula 
(unique  in  DNF)  ao  of  the  form  Vi  Aj  (oij  =  Vij)  such  that  the  meaning 
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[ao]  =  G;  Then,  the  granule  {G,G')  may  be  represented  logically  as  a  pair 
[acijc^G']  corresponding  to  the  dependency  rule  aa  Oic  [2].  There  are 
two  characteristics  of  the  granule  (G,  G')  important  in  applications  to  decision 
algorithms  viz.  the  characteristic  whose  values  measure  what  part  of  [G']  is  in 
[G]  (the  ’strength'  of  the  rule  ao  aoO  and  the  characteristic  whose  values 
measure  what  part  of  [G]  is  in  [G']  (the  '  strength  of  the  support!  for  the  rule 
ao  =>  otG>)- 

A  standard  choice  of  an  appropriate  measure  may  be  based  on  frequency 
count;  the  formal  rendering  is  the  standard  rough  inclusion  function  [3]  defined 
for  two  sets  X^Y  C  U  hy  the  formula  p{X,  Y)  =  ^^card{xj^  when  X  is  non  - 
empty  and  p{X,  F)  =  1,  otherwise. 

To  select  suificiently  strong  rules,  we  would  set  a  threshold  pcr-  We  define 
then,  in  analogy  with  machine  learning  techniques,  two  characteristics: 

(p)  piG,G')=p{[Gl  [G']);  {g)  r/(G,G')  = /i([G'],  [G]) 

and  we  call  an  {ri,p)  granule  of  knowledge  any  granule  (G,G0  such  that 

(i)  p{G,  G')  >  Per]  (ii)  V{G,  G')  >  per 

This  logical  model  of  granulation  may  not  be  adequate  to  practical  demands: 
the  relation  IN D  may  be  too  rigid  and  ways  of  its  relaxation  are  among  most 
intensively  studied  topics  [7].  Here,  we  propose  to  introduce  rough  mereological 
approach  to  the  granulation  problem  in  which  7A^jD-classes  are  replaced  with 
mereological  classes  i.e.  similarity  classes. 


2  Rough  mereology 

Rough  mereology  [4],  [5],  [8]  has  been  proposed  and  studied  as  means  of  clus¬ 
tering  in  a  relational  way.  Formally,  it  defines  a  functor  p{r)  of  being  a  part 
in  degree  at  least  r  for  each  r  e  [0,1].  Rough  mereology  may  be  introduced 
conveniently  in  the  logical  framework  of  ontology  and  mereology  proposed  by 
Stanislaw  Lesniewski  [1]. 

2.1  Mereology 

We  begin  with  the  notion  of  a  part  functor.  This  sets  the  meaning  of  ”  X  is  a  part 
of  F”.  We  will  use  the  notation  of  Ontology  of  Lesniewski  XeY  (reads  ”X  is 
F”)  which  replaces  the  standard  notation  of  naive  set  theory  as  more  convenient. 
The  meaning  of  XeY  is  specified  as: 

XeY  ^  aZ.ZeX  A  ^Z.{ZeX  ZeY)  A  VC/,  W,{UeX  A  WeX  =>  UeW) 

which  means  that  X  is  an  individual,  anything  which  is  X  is  F  and  X  is  non¬ 
empty.  The  symbol  V  denotes  the  universe  and  is  defined  via 

XeV  4=^  3F.XeF 

We  rephrase  basic  axioms  for  pt. 

(MEl)  Xept{Y)  =>  BZ.ZeX  A  XeV  A  FeF; 
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(ME2)  Xept{Y)  A  Yept(Z)  Xept{Z)\ 

(ME3)  nan{Xept{X)). 

Then  Xept{Y)  means  that  the  individual  denoted  X  is  a  proper  part  (in 
virtue  of  (ME3))  of  the  individual  denoted  Y.  The  concept  of  an  improper  part 
is  reflected  in  the  notion  of  an  element  el;  this  is  a  name  -  forming  functor 
defined  as  follows:  X£el{Y)  X£pt{Y)  y  X  =  Y. 

We  will  require  that  the  following  inference  rule  be  valid. 

(ME4)  yT,(T€el{X)  =>  3WW£el{T)  A  W£el{Y))  Xeel{Y). 

The  notion  of  a  collective  class  i.e.  of  an  object  composed  of  other  objects  which 
are  its  elements  may  be  introduced  at  this  point;  this  is  effected  by  means  of  a 
functor  Kl  defined  as  follows. 

XeKl{Y)  ^  3Z,ZeY ^yZ.{ZeY  ==>  Zeel{X))  A 

yZ.{Zeel{X)  3U,  W.UsY  A  W£el{U)  A  Weel{Z)), 

Thus,  the  class  consists  of  all  objects  which  have  an  element  in  common  with 
an  object  with  the  class  defining  property. 

The  notion  of  a  class  is  subjected  to  the  following  restrictions 

(ME5)  XeKliY)  A  Z€Kl{Y)  ZeX  {Kl{Y)  is  an  individual); 

(ME6)  BZ.ZeY  3Z.Z£Kl{Y)  (the  class  exists  for  each  non-empty  name). 

Thus,  Kl{Y)  is  defined  for  any  non-empty  name  Y  and  Kl{Y)  is  an  individual 
object. 

2.2  Rough  Mereology:  first  notions 

Rough  Mereology  has  been  proposed  and  studied  in  [4],  [5],  [8]  as  a  vehicle  for 
reasoning  under  uncertainty. 

The  following  is  a  list  of  basic  axiomatic  postulates  for  Rough  Mereology. 
We  introduce  a  graded  family  pr,  where  r  €  [0, 1]  is  a  real  number  from  the  unit 
interval,  of  functors  which  would  satisfy  (Mr(X)  is  a  new  name  derived  from  X 
via  pr)- 

(RMl)  Xepi(Y)  ^  XeeliY); 

(RM2)  XepiiY)  =>  yZ.{Z£pr{X)  =>  ZepriY)); 

(RM3)  X  =  r  A  XepriZ)  =>  YspriZ); 

(RM4)  XepriY)  As  <r=^  XepsiY); 

One  may  have  as  an  archetypical  rough  mereological  predicate  the  rough 
membership  function  of  Pawlak  and  Skowron  [3]  defined  in  an  extended  form  as: 

XspriY)  4=^  ^  case  X  non-empty,  1  else 

where  X,  Y  are  (either  exact  or  rough)  subsets  in  the  universe  U  of  an  informa¬ 
tion/decision  system  (U^A). 
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2.3  Rough  mereological  component  of  granulation 

The  functors  fir  may  enter  our  discussion  of  a  granule  and  of  the  relation  Gr  in 
each  of  the  following  ways: 

1.  Concerning  the  definitions  (r?),(p)  of  functions  ?7,p  ,  we  may  replace  in 
them  the  rough  membership  function  fi  with  a  function  fir  possibly  better  suited 
to  a  given  context; 

(p)  {rj)  G'enp^^G. 

2.  The  process  of  clustering  may  be  described  in  terms  of  the  class  functor 
of  mereology: 

(p)  Geel{KlpJG')y,  (r,)  G'ee/(if/,„(G)) 

where  Klr{X)  is  the  class  of  objects  Z  satisfying  ZsfirX,  We  will  adhere  to 
this  means  of  clustering  and  we  denote  in  the  sequel  Klr{X)  with  the  symbol 
gr{X,r)  read  as  the  granule  of  radius  r  centered  at  X. 

3  Adaptive  Calculus  of  Granules  for  Synthesis  in 
Distributed  Systems 

We  construct  a  mechanism  for  transferring  granules  of  knowledge  among  agents 
by  means  of  transfer  functions  induced  by  rough  mereological  connectives  ex¬ 
tracted  from  their  respective  information  systems  [5]. 

We  now  recall  basic  ingredients  of  our  scheme  of  agents  [5],  [8]. 


3.1  Distributed  Systems  of  Agents 

We  assume  that  a  pair  (/nu,  Ag)  is  given  where  Inv  is  an  inventory  of  elementary 
objects  and  Ag  is  a  set  of  inteligent  computing  units  called  shortly  agents.  We 
consider  an  agent  ag  G  Ag.  The  agent  ag  is  endowed  with  tools  for  reasoning  and 
communicating  about  objects  in  its  scope;  these  tools  are  defined  by  components 
of  the  agent  label. 

The  label  of  the  agent  ag  is  the  tuple 

lab{ag)  =  {A{ag),  fi{r)(ag),  L{ag),  Link{ag),0{ag),  St{ag)),Unc  -  rel{ag), 
Unc  -  rule{ag)^Dec  —  rule{ag)) 

where 

1.  A(ap)  is  an  information  system  of  the  agent  ag. 

2.  firiag)  is  a  functor  of  part  in  a  degree  at  ag. 

3.  L{ag)  is  a  set  of  unary  predicates  (properties  of  objects)  in  a  predicate 
calculus  interpreted  in  the  set  U{ag). 

4.  St{ag)  =  {st{ag)i,...,st(ag)n}  C  U(ag)  is  the  set  of  standard  objects  at  ag. 

5.  Link{ag)  is  a  collection  of  strings  of  the  form  agiag2^-ctgkag  which  are  el¬ 
ementary  teams  of  agents;  we  denote  by  the  symbol  Link  the  union  of  the 
family  {Link{ag)  :  ag  G  Ag}. 
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6.  0{ag)  is  the  set  of  operations  at  ag\  any  o  €  0{ag)  is  a  mapping  of  the 
Cartesian  product  U(agi)  x  27 (0^2)  x  •••  x  U{agk)  into  the  universe  U{ag) 
where  agiag2-..agkCLg  €  Link[ag). 

7.  Unc—rel{ag)  is  the  set  of  parameterized  uncertainty  relations  pi  =  pi{oi{ag)^ 
st{agi)i,st{ag2)u.-.-,st{agk)ust{ag))  where  api,ap2,  •••apA:, ap  G  Link{ag), 
Oi{ag)  G  0{ag)  are  such  that 

Pi{{xuei),  {X2,e2), {xk.Sk),  (ar,£)) 

holds  for  xi  G  27(a^i),a:2  G  27(ap2),  >>^Xk  €  U{agk)  and  £1,6:2,  €  [0, 1] 

iff  XjEp,{agj)e^{st{agj))  for  j  =  1,2, and  xep[ag)e{st{ag))  where 

Oi{st{agi),st{ag2),..,st{agk))  =  st{ag)  and  Oi(a;i,a:2, --.Xk)  =  x. 

Uncertainty  relations  express  the  agents  knowledge  about  relationships  am¬ 
ong  uncertainty  coefficients  of  the  agent  ag  and  uncertainty  coefficients  of 
its  children. 

8.  Unc-  rule{ag)  is  the  set  of  uncertainty  rules  fj  where  fj  :  [0, 1]*  — >  [0, 1] 
is  a  function  which  has  the  property  that 

if  G  U{agi),X2  e  U{ag2),..,Xk  6  U{agk)  satisfy  the  conditions 


XieiJL{agi)s(agi){st{agi))fori  =  1,2,..,  A; 


then  Po{oj{xi,X2,..^,Xk),st{ag))  >  /j(£(opi),£(ap2), 
where  all  parameters  are  as  in  7. 

9.  Dec  —  rule{ag)  is  a  set  of  decomposition  rules  dec  —  rulei  and 

(^(api),^(ap2),..,^(apft),^(op))  €  dec-rulci 

where  ^(api)  G  i^(api),^(ap2)  €  L(ap2),  €  L{agk),^{ag)  G  L{ag) 

and  agiag2.-agkag  G  Link{ag))  iff  there  exists  a  collection  of  standards 
st(api),  st(ap2)v)St(apfc),  st{ag)  with  the  properties  that  Oj{st(agi),  st{ag2) 
,...,  st{agk))  =  st{ag),  st{agi)  satisfies  ${agi)  for  i  =  1, 2, k  and  st{ag)  sat¬ 
isfies  ^{ag).  Decomposition  rules  are  decomposition  schemes  in  the  sense 
that  they  describe  the  standard  st{ag)  and  the  standards  st(agi)^ ...,  st{agk) 
from  which  the  standard  st{ag)  is  assembled  under  oi  in  terms  of  predicates 
which  these  standards  satisfy. 


3.2  Approximate  Synthesis  of  Complex  Objects 

The  process  of  synthesis  of  a  complex  object  (e.g.  signal,  action)  by  the  above 
defined  scheme  of  agents  consists  in  our  approach  of  the  two  communication 
stages  viz.  the  top  -  down  communication/negotiation  process  and  the  bottom 
-  up  communication/assembling  process.  We  outline  the  two  stages  here  in  the 
language  of  approximate  formulae. 
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Approximate  logic  of  synthesis  We  assume  for  simplicity  that  the  relation 

<  ag,  which  holds  for  agents  ag‘,ag  €  Ag  iff  there  exists  a  string  agiag2. 
..aguag  e  Link{ag)  with  ag'  =  agi  for  some  i  <k,  orders  the  set  Ag  into  a  tree. 
We  also  assume  that  0{ag)  =  {o(a^)}  for  ag  €  Ag  i.e.  each  agent  has  a  unique 
assembling  operation. 

We  recall  a  logic  L{Ag)  [5],  [8]  in  which  we  can  express  global  properties  of  the 
synthesis  process. 

Elementary  formulae  of  L(Ag)  are  of  the  form  <  st{ag)y^{ag)ye{ag)  >  where 
st{ag)  e  St{ag)y^{ag)  €  L{ag)ye{ag)  G  [0,1]  for  any  ag  G  Ag.  Formulae  of 
L{ag)  form  the  smallest  extension  of  the  set  of  elementary  formulae  closed  under 
propositional  connectives  V,  A,  and  under  the  modal  operators  □,  O. 

The  meaning  of  a  formula  ^(ag)  is  defined  classically  as  the  set  [^{ag)]  = 
{u  G  U(ag)  :  u  has  the  property  ^{ag)};  we  express  satisfaction  by  u  h  ^ag). 
For  X  G  U{ag)y  we  say  that  x  satifies  <  st(ag)y^{ag)ye{ag)  >,  in  symbols: 

a:  l-<  st{ag)y^{ag)y€{ag)  >, 

iff  (i)  I  st{ag)  H  ^{ag);  and  (ii)  \x£fi{ag)e(a9){st{ag)). 

We  extend  satisfaction  over  formulae  by  recursion  as  usual. 

By  a  selection  over  Ag  we  mean  a  function  sel  which  assigns  to  each  agent 
ag  an  object  sel{ag)  G  U(ag).  For  two  selections  sel,  seV  we  say  that  sel  induces 
seV y  in  symbols  sel  -^Ag  seV  when  sel{ag)  =  seV(ag)  for  any  ag  G  Leaf(Ag)  and 
sel'lag)  =  o{ag){seViagi)y  seV{ag2),  sel'iagk))  for  any  agiag2>..agkag  G  Link. 

We  extend  the  satisfiability  predicate  H  to  selections:  for  an  elementary 
formula  <  st{ag)y  ^{ag),  e{ag)  >,  we  let  sel  \-<  st{ag)y^{ag)y£{ag)  >  iff 
sel{ag)  f-<  st{ag)y^{ag)ye{ag)  >  . 

We  now  let  sel  \-  O  <  st(ag)y^{ag),€{ag)  >  when  there  exists  a  selection 
seV  satisfying  the  conditions:  sel  -^Ag  seV\seV  h<  st{ag)y^{ag)ye{ag)  >  . 

In  terms  of  L{Ag)  it  is  possible  to  express  the  problem  of  synthesis  of  an 
approximate  solution  to  the  problem  posed  to  Ag.  We  denote  by  head{Ag)  the 
root  of  the  tree  (Ag^  <)  and  by  Leaf(Ag)  the  set  of  leaf-agents  in  Ag.  In  the 
process  of  top  -  down  communication,  a  requirement  ^  received  by  the  scheme 
from  an  external  source  (which  may  be  called  a  customer)  is  decomposed  into 
approximate  specifications  of  the  form  <  st{ag)y^{ag)y£{ag)  >  for  any  agent  ag 
of  the  scheme.  The  decomposition  process  is  initiated  at  the  agent  head{Ag)  and 
propagated  down  the  tree.  We  are  able  now  to  formulate  the  synthesis  problem. 

Synthesis  problem.  Given  a  formula 

a  :<  st(head{Ag))y^{head{Ag))ye(head{Ag))  > 

find  a  selection  sel  over  the  tree  {Ag,  <)  with  the  property  sel  h  a. 

A  solution  to  the  synthesis  problem  with  a  given  formula  a  is  found  by 
negotiations  among  the  agents  based  on  uncertainty  rules  and  their  succesful 
result  can  be  expressed  by  a  top-down  recursion  in  the  tree  <)  as  follows: 
given  a  local  team  agiag2.^.agkag  with  the  formula  <  st(ag)y$(ag)y€{ag)  > 
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already  chosen,  it  is  sufficient  that  each  agent  agi  choose  a  standard  st{agi)  € 
U{agi)^  a  formula  ^{agi)  €  L{agi)  and  a  coefficient  e{agi)  G  [0, 1]  such  that 


(iii) 

(#(api),  ^(ap2)v-  ^{0'9k),^{ag))  G  Dec-rule{ag) 

with  standards  st{ag),  st{ag\) 

,...,  st{agkyf 

(iv) 

f{e(a9\),-,e{a9k))  >  e(ag) 

where  /  satisfies  unc  —  rule{ag)  with  st(ag)^  st{agi)^  st{agk)  and  e{agi)^ 
...,e{agk),  e{ag). 

For  a  formula  a,  we  call  an  a  -  scheme  an  assignment  of  a  formula  a(ag)  : 
<  st{ag),^{ag)^e{ag)  >  to  each  ag  G  Ag  in  such  manner  that  (iii),  (iv)  above 
are  satisfied  and  a{head{Ag))  is  <  st{head{Ag))^^{head{Ag)),  €{head{Ag))  >  . 
We  denote  this  scheme  with  the  symbol 

sch{<  st{head{Ag)),^{head{Ag)),s{head{Ag))  >). 

We  say  that  a  selection  sel  is  compatible  with  a  scheme  sch{<  st{head(Ag)) ^ 
^(head{Ag)),  e{head{Ag))  >)  in  case  sel{ag)efi{ag)^(^ag){sK^9))  for  each  leaf 
agent  ag  €  Ag. 

The  goal  of  negotiations  can  be  summarized  now  as  follows. 

Proposition  1.  Given  a  formula  <  st{head{Ag)),^{head{Ag))^  e(head{Ag))  >: 
if  a  selection  sel  is  compatible  with  a  scheme  sch{<  st{head{Ag)) ,  ^{head{Ag)) , 
£{head{Ag))  >)  then  sel\-  O  <  st{head{Ag)),^{head{Ag)),  £{head{Ag))  >  . 

4  Calculi  of  Granules  in  (Inv^  Ag) 

We  construct  for  a  given  system  (Ag,  <)  of  agents  a  granulation  relation  Gr{ag) 
for  any  agent  ag  G  Ag  depending  on  parameters  e{ag),p{ag).  We  may  have 
various  levels  of  granulation  and  a  fortiori  various  levels  of  knowledge  compresion 
about  synthesis;  we  address  here  a  simple  specimen. 


4.1  Calculi  of  pre-granules 

For  a  standard  st{ag)  and  a  value  e{ag),  we  denote  by  gr{st{ag),£(ag))  the  pre¬ 
granule  Kl£(^ag){st{ag));  then,  a  granule  selector  selg  is  a  map  which  for  each 
ag  G  Ag  chooses  a  granule  selg{ag)=gr{st{ag),€{ag)). 

We  say  that  gr{st{ag),£{ag))  satisfies  a  formula  a  :<  st{ag),^(ag),£(ag)  > 
(gr{st{ag),£{ag))  1-  a)  in  case  st(ag)  h  ^{ag).  Given  ag^ag^^-^agkag  G  Link  and 
a  formula  <  st{ag),^{ag).,e{ag)  >  along  with  /  satisfying  unc  —  rule{ag)  with 
st{ag),  $t{agi),  ...,  st{agk)  and  £{agi)^  ...,e(a^fc),  £{ag),  o{ag)  maps  the  product 
'Xigr{st{agi),€(agi))  into  gr{st{ag)^£{ag)).  Composing  these  mappings  along  the 
tree  {Ag,  <),  we  define  a  mapping which  maps  any  set  {gr{st{ag),e{ag))  : 
ag  G  Leaf{Ag)}  into  a  granule  gr{st{head{Ag),e{head{Ag)).  We  say  that  a 


27 


selection  selg  is  compatible  with  a  scheme  seh{<  st(head{Ag)),  ^{head(Ag)), 
e{head{Ag))  >)  if 

selg{agi)  =  gr(st{agi),€'(agi))eel{gr{st{agi),£{agi)) 
for  each  leaf  agent  agi.  As 

prodAg(selg)  \-<  st{head{Ag)),0{head{Ag)),£{head{Ag))  > 
we  have  the  pre-granule  counterpart  of  Proposition  1. 

Proposition  2.  Given  a  formula  <  st{head(Ag)),^(head(Ag)),  £{headlAg))  >: 
if  a  selection  selg  is  compatible  with  a  scheme  sch{<  8t{h€ad(Ag)),  ^{head{Ag)), 
€{head{Ag))  >)  then  selg  h  O  <  st(head{Ag))^^(head{Ag)),  £{head{Ag))  >  . 

5  Associated  grammar  systems:  a  granular  semantics  for 
Computing  with  Words 

We  are  now  in  position  to  present  our  discussion  in  the  form  of  a  grammar 
system  related  to  the  multi  -  agent  tree  {Ag^  <)  [6].  With  each  agent  ag  6  Ap, 
we  associate  a  grammar  r(ag)  —  (iV(ap),  T{ag),  P{ag)),  To  this  end,  we  assume 
that  a  finite  set  S{ag)  C  [0, 1]  is  selected  for  each  agent  ag.  We  let  N(ag)  = 
{(s#(ap),*e{a5))  :  ^  L{ag),£(ag)  €  B{ag)}  where  S4>(o^)is  a  non-terminal 

symbol  corresponding  in  a  one  -  to  -  one  way  to  the  formula  i{ag)  and  similarly 
te{ag)  Corresponds  to  £(ag).  The  set  of  terminal  symbols  T{ag)  is  defined  for  each 
non-leaf  agent  ag  by  letting 

T{ag)  =  U{{(s#(aai).<f(a9i))  =  ^(“^0  6  L{agi),e{agi)  e  H(o5i)}  :  i  =  1,2,.,  A:} 
where  agiag2.-^agkag  6  Link  . 

The  set  of  productions  P{ag)  contains  productions  of  the  form 

(v)  {s^(^ag)i^e(ag))  ^  '^e(api))(^^(o52)  J 

where  (o(op),  #(api),  ^(ap2),  *«,  ^{aOk),  ^{ag),  st{agi) ,  st{ag2) ,  st(agk), 
st(ag),£{ag),£{agi),£{ag2),...,£{agk))  satisfy  (iii),  (iv). 

We  define  a  grammar  system  P  ~  (T,  {r{ag)  :  ag  e  Ag,  ag  non-leaf  or 
ag  =  Inpat),  S)  by  choosing  the  set  T  of  terminals  as  follows: _ 

(vi)  T  =  {{(sj>(ag),te(ag))  :  ^(gp)  e  L(ap),g(ap)  €  H(ap)}  :  ag  e  LeafjAg)}; 
and  introducing  an  additional  agent  Input  y^ith  non  -  terminal  symbol  S, 
terminal  symbols  of  Input  being  non-terminal  symbols  of  head{Ag)  and 
productions  of  Input  of  the  form: 

(vii)  S  {s^{head{Ag))^l'eihead{Ag))) 

where  ^(head{Ag))  G  L{head{Ag)),£{head{Ag))  G  S{head{Ag)). 

The  meaning  of  S  is  that  it  codes  an  approximate  specification  (requirement) 
for  an  object;  productions  of  Input  code  specifications  for  approximate  solu¬ 
tions  in  the  language  of  the  agent  head{Ag).  Subsequent  rewritings  produce 
terminal  strings  of  the  form 
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where  agi,ag2,  --iagh  are  all  leaf  agents  in  Ag. 


We  have 

Propositions.  Suppose  (5<|i(api),te(api))(5^(as2)»^e(a52))-(^^(a5fc)>*e(asjfc))  of 
the  form  (viii)  and  it  is  obtained  from  S  — >  {s^{headiAg))i'^e{head(Ag)))  subse¬ 
quent  rewriting  by  means  of  productions  in  F.  Then  given  any  selection  sel  with 
sel{agi)£p(agi){e{agi)st{agi)  for  i  =  we  have 

sel\=  O  <  st{head{Ag),^{head{Ag)),£{head{Ag)  >  . 

Let  us  observe  that  each  of  grammars  r*  is  a  linear  context-free  grammar.  We 
have  thus  linear  languages  L{r)  which  provide  a  semantics  for  Computing  with 
Words. 
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Abstract.  One  of  the  most  difficult  problems  in  modeling  medical  rea¬ 
soning  is  to  model  a  procedure  for  diagnosis  about  complications.  In 
medical  contexts,  a  patient  sometimes  suffers  from  several  diseases  and 
has  complicated  symptoms,  which  makes  a  differential  diagnosis  very  dif¬ 
ficult.  For  example,  in  the  domain  of  headache,  a  patient  suffering  from 
migraine,  (a  vascular  disease),  may  also  suffer  from  muscle  contraction 
headache(a  muscular  disease).  In  this  case,  symptoms  specific  to  vascu¬ 
lar  diseases  will  be  observed  with  those  specific  to  muscular  ones.  Since 
one  of  the  essential  processes  in  diagnosis  of  headache  is  discrimination 
between  vascular  and  muscular  diseases^ ,  simple  rules  will  not  work  to 
rule  out  one  of  the  two  groups.  However,  medical  experts  do  not  have 
this  problem  and  conclude  both  diseases.  In  this  paper,  three  models 
for  reasoning  about  complications  are  introduced  and  modeled  by  using 
characterization  and  rough  set  model.  This  clear  representation  suggests 
that  this  model  should  be  used  by  medical  experts  implicitly. 


1  Introduction 

One  of  the  most  difficult  problems  in  modeling  medical  reasoning  is  to  model  a 
procedure  for  diagnosis  about  complications.  In  medical  contexts,  a  patient  some¬ 
times  suffers  from  several  diseases  and  has  complicated  symptoms,  which  makes 
a  differential  diagnosis  very  difficult.  For  example,  in  the  domain  of  headache,  a 
patient  suffering  from  migraine,  (a  vascular  disease),  may  also  suffer  from  mus¬ 
cle  contraction  headache(a  muscular  disease).  In  this  case,  symptoms  specific  to 
vascular  diseases  will  be  observed  with  those  specific  to  muscular  ones.  Since 
one  of  the  essential  processes  in  diagnosis  of  headache  is  discrimination  between 
vascular  and  muscular  diseases^,  simple  rules  will  not  work  to  rule  out  one  of  the 
two  groups.  However,  medical  experts  do  not  have  this  problem  and  conclude 
both  diseases. 

In  this  paper,  three  models  for  reasoning  about  complications  are  introduced 
and  modeled  by  using  characterization  and  rough  set  model.  This  clear  repre¬ 
sentation  suggests  that  this  model  should  be  used  by  medical  experts  implicitly. 

^  The  second  step  of  differential  diagnosis  will  be  to  discriminate  diseases  within  each 
group. 

^  The  second  step  of  differential  diagnosis  will  be  to  discriminate  diseases  within  each 
group  [2]. 


The  paper  is  organized  as  follows:  Section  2  discusses  reasoning  about  compli¬ 
cations.  Section  3  shows  the  definitions  of  statistical  measures  used  for  modeling 
rules  based  on  rough  set  model.  Section  4  presents  a  rough  set  model  of  com¬ 
plications  and  an  algorithm  for  induction  of  plausible  diagnostic  rules.  Section 
5  gives  an  algorithm  for  induction  of  reasoning  about  complications.  Section  6 
discusses  related  work.  Finally,  Section  7  concludes  this  paper. 

2  Reasoning  about  Complications 

Medical  experts  look  for  the  possibilities  of  complications  when  they  meet  the 
following  cases.  (1)  A  patient  has  several  symptoms  which  cannot  be  explained 
by  the  final  diagnostic  candidates.  In  this  case,  each  diagnostic  candidate  belongs 
to  the  different  disease  category  and  will  not  intersect  each  other  (independent 
type).  (2)  A  patient  has  several  symptoms  which  will  be  shared  by  several  dis¬ 
eases,  each  of  which  belongs  to  different  disease  categories,  and  which  are  impor¬ 
tant  to  confirm  some  diseases  above.  In  this  case,  each  diagnostic  candidate  will 
have  some  intersection  with  respect  to  characterization  of  diseases  (boundary 
type).  (3)  A  patient  has  several  symptoms  which  suggest  that  his  disease  will 
progress  into  the  more  specific  ones  in  the  near  future.  In  this  case,  the  specific 
disease  will  belong  to  the  subcategory  of  a  disease  (subcategory  type). 

3  Probabilistic  Rules 

3.1  Accuracy  and  Coverage 

In  the  subsequent  sections,  we  adopt  the  following  notations,  which  is  introduced 

in  [7]. 

Let  U  denote  a  nonempty,  finite  set  called  the  universe  and  A  denote  a 
nonempty,  finite  set  of  attributes,  i.e.,  a  :  U  -4  14  for  a  G  A,  where  14  is  called 
the  domain  of  a,  respectively. Then,  a  decision  table  is  defined  as  an  information 
system,  A  =  {U,A\J  {d}). 

The  atomic  formulas  over  B  C  AU  {d}  and  V  are  expressions  of  the  form 
[a  =  v],  called  descriptors  over  B,  where  a  G  J5  and  v  £Va-  The  set  F{B^V)  of 
formulas  over  B  is  the  least  set  containing  all  atomic  formulas  over  B  and  closed 
with  respect  to  disjunction,  conjunction  and  negation. 

For  each  /  G  F{B,  V),  /a  denote  the  meaning  of  /  in  A,  i.e.,  the  set  of  all 
objects  in  U  with  property  /,  defined  inductively  as  follows. 

1.  If  /  is  of  the  form  [a  =  i;]  then,  fA  =  {se  U\a(s)  —  v} 

2.  (/  A  g)A  =  /a  n  gA\  (/  V  g)A  =  /a  V  (-'/)a  =  U  -  fa 

By  the  use  of  this  framework,  classification  accuracy  and  coverage,  or  true  pos¬ 
itive  rate  is  defined  as  follows. 
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Definition  1. 

Let  R  and  D  denote  a  formula  in  F{B,  V)  and  a  set  of  objects  which  belong  to 
a  decision  d.  Classification  accuracy  and  coverage(true  positive  rate)  iox  R d 
is  defined  as: 


aR{D)  =  P{D\R)\  and 

kh{D)  =  P{R\D)), 

where  \A\  denotes  the  cardinality  of  a  set  A,  aR(D)  denotes  a  classification 
accuracy  of  R  as  to  classification  of  D,  and  kr{D)  denotes  a  coverage,  or  a  true 
positive  rate  of  to  D,  respectively. 

It  is  notable  that  these  two  measures  are  equal  to  conditional  probabilities: 
accuracy  is  a  probability  of  D  under  the  condition  of  jR,  coverage  is  one  of  R 
under  the  condition  of  D.  It  is  also  notable  that  aR{D)  measures  the  degree  of 
the  sufficiency  of  a  proposition,  R  D,  and  that  kr{D)  measures  the  degree  of 
its  necessity.^ 

For  example,  if  aR{D)  is  equal  to  1.0,  then  R  D  is  true.  On  the  other 
hand,  if  kr{D)  is  equal  to  1.0,  then  D  ^  Ris  true.  Thus,  if  both  measures  are 
1.0,  then  R^  D, 

Also,  Pawlak  recently  reports  a  Bayesian  relation  between  accuracy  and 
coverage  [5]: 


aR{D)P{D)  =  P{R\D)P{D)  =  P{R.D) 

=  P{R)P{D\R)  -  tiR{D)P{R) 

This  relation  also  suggests  that  a  priori  and  a  posteriori  probabilities  should  be 
easily  and  automatically  calculated  from  database. 


3.2  Definition  of  Rules 

By  the  use  of  accuracy  and  coverage,  a  probabilistic  rule  is  defined  as: 

R^  d  s.t.  R  =  Aj  V/k  [flj  =  Vk],  aR{D)  >  5a, 
f<'R(D)  >  5^. 

This  rule  is  a  kind  of  probabilistic  proposition  with  two  statistical  measures, 
which  is  an  extension  of  Ziarko’s  variable  precision  model(VPRS)  [12].^ 

It  is  also  notable  that  both  a  positive  rule  and  a  negative  rule  are  defined  as 
special  cases  of  this  rule,  as  shown  in  the  next  subsections. 

^  These  characteristics  are  from  formal  definition  of  accuracy  and  coverage.  In  this 
paper,  these  measures  are  important  not  only  from  the  viewpoint  of  propositional 
logic,  but  also  from  that  of  modelling  medical  experts’  reasoning,  as  shown  later. 

^  This  probabilistic  rule  is  also  a  kind  of  Rough  Modus  Ponens[4i]. 
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4  Rough  Set  Model  of  Complications 

4.1  Definition  of  Characterization  Set 

In  order  to  model  these  three  reasoning  types,  a  statistical  measure,  coverage 
kr{D)  plays  an  important  role  in  modeling,  which  is  a  conditional  probability 
of  a  condition  (R)  under  the  decision  D  {P{R\D)), 

Let  us  define  a  characterization  set  of  D,  denoted  hy  L{D)  as  a  set,  each 
element  of  which  is  an  elementary  attribute- value  pair  R  with  coverage  being 
larger  than  a  given  threshold,  That  is, 

LsAD)  =  {[a«  =  i'j]|«[ai=vjl(-D)  >  <5,.}. 

Then,  according  to  the  descriptions  in  Section  2,  three  models  of  reasoning  about 
complications  will  be  defined  as  below: 

1.  Independent  type:  Ls^{Di)  fl  L5^{Dj)  =  <^, 

2.  Boundary  type:  Ls^{Di)  D  Ls^{Dj)  ^  0,  and 

3.  Subcatgory  type:  Ls^{Di)  C  Ls^{Dj). 

All  three  definitions  correspond  to  the  negative  region,  boundary  region, 
and  positive  region[2],  respectively,  if  a  set  of  the  whole  elementary  attribute- 
value  pairs  will  be  taken  as  the  universe  of  discourse.  Thus,  reasoning  about 
complications  are  closely  related  with  the  fundamental  concept  of  rough  set 
theory. 


4.2  Characterization  as  Exclusive  Rules 

Characteristics  of  characterization  set  depends  on  the  value  of  5^.  If  the  threshold 
is  set  to  1.0,  then  a  characterization  set  is  equivalent  to  a  set  of  attributes  in 
exclusive  rules[8].  That  is,  the  meaning  of  each  attribute- value  pair  in  Li.o(D) 
covers  all  the  examples  of  D.  Thus,  in  other  words,  some  examples  which  do  not 
satisfy  any  pairs  in  Li  o{D)  will  not  belong  to  a  class  D. 

Construction  of  rules  based  on  Li.o  are  discussed  in  Subsection  4.4,  which 
can  also  be  found  in  [9,  10].  The  differences  between  these  two  papers  are  the 
following:  in  the  former  paper,  independent  type  and  subcategory  type  for  Li.o 
are  focused  on  to  represent  diagnostic  rules  and  applied  to  discovery  of  decision 
rules  in  medical  databases.  On  the  other  hand,  in  the  latter  paper,  a  boundary 
type  for  Li.o  is  focused  on  and  applied  to  discovery  of  plausible  rules. 


4.3  Rough  Inclusion 


Concerning  the  boundary  type,  it  is  important  to  consider  the  similarities  be¬ 
tween  classes.  In  order  to  measure  the  similarity  between  classes  with  respect  to 
characterization,  we  introduce  a  rough  inclusion  measure  which  is  defined  as 


follows. 


ti(S,T)  = 


|gnr| 

I5I 
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It  is  notable  that  if  5'  C  T,  then  /i(5,T)  =  1.0,  which  shows  that  this  relation 
extends  subset  and  superset  relations.  This  measure  is  introduced  by  Polkowski 
and  Skowron  in  their  study  on  rough  mereology[6].  Whereas  rough  mereology 
firstly  applies  to  distributed  information  systems,  its  essential  idea  is  rough  in¬ 
clusion:  Rough  inclusion  focuses  on  set-inclusion  to  characterize  a  hierarchical 
structure  based  on  a  relation  between  a  subset  and  superset.  Thus,  application 
of  rough  inclusion  to  capturing  the  relations  between  classes  is  equivalent  to 
constructing  rough  hierarchical  structure  between  classes,  which  is  also  closely 
related  with  information  granulation  proposed  by  Zadeh[ll]. 


4.4  Rule  Induction  Algorithm 

Algorithms  for  induction  of  plausible  diagnostic  rules  (boundary  type)  are  given 
in  Fig  1  to  3,  which  are  fully  discussed  in  [10].  Since  subcategory  type  and 
independent  type  can  be  viewed  as  special  types  of  boundary  type  with  respect  to 
rough  inclusion,  rule  induction  algorithms  for  subcategory  type  and  independent 
type  are  given  if  the  thresholds  for  are  set  up  to  1.0  and  0.0,  respectively. 

Rule  induction  (Fig  1.)  consists  of  the  following  three  procedures.  First,  the 
characterization  of  each  given  class,  a  list  of  at  tribute- value  pairs  the  supporting 
set  of  which  covers  all  the  samples  of  the  class,  is  extracted  from  databases  and 
the  classes  are  classified  into  several  groups  with  respect  to  the  characterization. 
Then,  two  kinds  of  sub-rules,  rules  discriminating  between  each  group  and  rules 
classifying  each  class  in  the  group  are  induced(Fig  2).  Finally,  those  two  parts 
are  integrated  into  one  rule  for  each  decision  attribute(Fig  3). 

5  Induction  of  Complication  Rules 

Simple  version  of  complication  rules  are  formerly  called  disease  image,  which 
had  a  very  simple  form  in  earlier  versions[8].  Disease  image  is  constructed  from 
Lo.o(D),  as  disjunctive  formula  of  all  the  members  of  this  characterization  set. 
In  this  paper,  complication  rules  are  defined  more  effectively  to  detect  compli¬ 
cations.  This  rule  is  used  to  detect  complications  of  multiple  diseases,  acquired 
by  all  the  possible  manifestations  of  the  disease.  By  the  use  of  this  rule,  the 
manifestations  which  cannot  be  explained  by  the  conclusions  will  be  checked, 
which  suggest  complications  of  other  diseases.  These  rules  consists  of  two  parts: 
one  is  a  collection  of  symptoms,  and  the  other  one  is  a  rule  for  each  symptoms, 
which  are  important  for  detection  of  complications. 

1. R  "id  s.t.  R  =■  yRjk  —  Vj  Vjt  [cij  =■  Vfc], 

2. Rjk''4di  s.L  Rjk  =  [aj=Vkl 

OiRjkiDi)  >  r}a,  KRjkiDi)  >  77«, 

where  rj  denotes  a  threshold  for  a  and  k. 

The  first  part  can  be  viewed  as  rules,  whose  attribute- value  pairs  belong  to 
[/-Lo.o(Di).for  each  class  Di.  On  the  other  hand,  the  second  part  can  be  viewed 
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procedure  Rule  Induction  {Total  Process); 
var 

i  :  integer]  M,L,R  :  List] 

Ld  :  List]  /*  A  list  of  all  classes  */ 
begin 

Calculate  aR(Di)  and  /Cit(I)i) 

for  each  elementary  relation  R  and  each  class  Di ; 
Make  a  list  L(A)  =  {R\kr(D)  =  1.0}) 
for  each  class  Di] 
while  {Ld  ^  <^)  do 
begin 

Di  :=  first{LD)]  M  :=  Ld  —  Di] 
while  {M  ^  (f)  do 
begin 

Dj  :=  first{M)] 
if  ifz{L{Dj),L{Di))<S^) 
then  L2{Di)  :=  L2{Di)  +  {Dj}] 

M  :=M~Dj] 

end 

Make  a  new  decision  attribute  D^  for  L2{Di)] 
Ld  :=  Ld  —  Di] 
end 

Construct  a  new  table  (T2(Pi))for  L2{Di). 
Construct  a  new  table(T(P5)) 
for  each  decision  attribute  D[] 

Induce  classification  rules  R2  for  each  L2{D)] 

/*  Fig.2  V 

Store  Rules  into  a  List  R{D)] 

Induce  classification  rules  Rd 

for  each  P'  in  T{D')]  /*  Fig.2  */ 

Store  Rules  into  a  List  P(P')(=  R{L2(Di))) 
Integrate  R2  and  Rd  into  a  rule  Rd] 

r  Fig-3  V 

end  {Rule  Induction  }; 


Fig.  1.  An  Algorithm  for  Rule  Induction 


as  rules,  whose  attribute- value  pairs  comes  from  Lr^^{Dj)  such  that  i  ^  j.  Thus, 
complication  rules  can  be  constructed  from  overlapping  region  of  ?7  —  I/o.o(^x) 
and  Lr^^{Dj). 

6  Discussion 

6.1  Conflict  Analysis 

It  is  easy  to  see  the  relations  of  independent  type  and  subcategory  type.  While 
independent  type  suggests  different  mechanisms  of  diseases,  sub  category  type 
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procedure  Induction  of  Classification  Rules] 

var 

i :  integer]  M,  Li  :  List] 
begin 

Li  l—  Lcr] 

/*  Ler’.  List  of  Elementary  Relations  */ 
i  :=  1;  M  :=  {}; 
for  i  :=  1  to  n  do 

/*  n:  Total  number  of  attributes  */ 
begin 

while  (  I/j  ^  {}  )  do 
begin 

Select  one  pair  R  =  A[ai  =  Vj]  from  Li] 

Li  :=  Li  -  {fi}; 

if  >  Sa)  and  (kr{D)  >  Sk) 

then  do  Sir  :=  Sir  +  {R}; 

/*  Include  R  as  Inclusive  Rule  */ 
else  M  :=  M  +  {R}] 
end 

Li+i  (A  list  of  the  whole  combination  of 
the  conjunction  formulae  in  M); 
end 

end  {Induction  of  Classification  Rules  }; 


Fig.  2.  An  Algorithm  for  Classification  Rules 


does  the  same  etiology.  The  difficult  one  is  boundary  type,  where  several  symp¬ 
toms  are  overlapped  in  each  Ls^{D).  In  this  case,  relations  between  Ls^{Di). 
and  Ls^{Dj)  should  be  examined. 

One  approach  to  these  complicated  relations  is  conflict  analysis  [3].  In  this 
analysis,  several  concepts  which  shares  several  attribute- value  pairs,  are  analyzed 
with  respect  to  qualitative  similarity  measure  that  can  be  viewed  as  an  extension 
of  rough  inclusion.  It  will  be  our  future  work  to  introduce  this  methodology  to 
analyze  relations  of  boundary  type  and  to  develop  an  induction  algorithms  for 
these  relations. 


6.2  Granular  Fuzzy  Partition 

Coverage  is  also  closely  related  with  granular  fuzzy  partition,  which  is  introduced 
by  Lin[l]  in  the  context  of  granular  computing. 

Since  coverage  kr(D)  is  equivalent  to  a  conditional  probability,  P{R\D), this 
measure  will  satisfy  the  condition  on  partition  of  unity,  called  -partition 
(If  we  select  a  suitable  partition  of  universe,  then  this  partition  will  satisfy 
kr{D)  =  1.0.  )  Also,  from  the  definition  of  coverage,it  is  also  equivalent 
to  the  counting  measure  for  [[re]/?  since  \D\  is  constant  in  a  given  universe 


36 


procedure  Rule  Integration; 
var 

i  :  integer;  M,  L2  :  List;  R{Di)  :  List; 

/*  A  list  of  rules  for  A  */ 

Ld  ■  List;  /*  A  list  of  all  classes  */ 
begin 

while(L£)  ^  0)  do 
begin 

A  :=  firstiLo);  M  :=  A(A); 

Select  one  rule  i?'  Dj  from  i?(A(A))- 
while  (M  ^  do 
begin 

Dj  first{M); 

Select  one  rule  R dj  for  Dj  ; 
Integrate  two  rules:  RAR'  —^dj. 
M  :=  M  -  {Dj}; 

end 

Ld  :=  Ld  —  Di; 

end 

end  {Rule  Combination} 


Fig.  3.  An  Algorithm  for  Rule  Integration 


U.  Thus,  this  measure  satisfies  a  “nice  context”,  which  holds: 


Hence,  all  these  features  show  that  a  partition  generated  by  coverage  is  a  kind 
of  granular  fuzzy  partition[l].  This  result  also  shows  that  the  characterization 
by  coverage  is  closely  related  with  information  granulation. 

From  this  point  of  view,  the  usage  of  coverage  for  characterization  and 
grouping  of  classes  means  that  we  focus  on  some  specific  partition  generated 
by  at  tribute- value  pairs,  the  coverage  of  which  are  equal  to  1.0  and  that  we 
consider  the  second-order  relations  between  these  pairs.  It  is  also  notable  that  if 
the  second-order  relation  makes  partition,  as  shown  in  the  example  above,  then 
this  structure  can  also  be  viewed  as  granular  fuzzy  partition. 

However,  rough  inclusion  and  accuracy  do  not  always  hold  the  nice  context. 
It  would  be  our  future  work  to  examine  the  formal  characteristics  of  coverage 
(and  also  accuracy)  and  rough  inclusion  from  the  viewpoint  of  granular  fuzzy 
sets. 
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Abstract.  This  paper  proposes  rough  genetic  algorithms  based  on  the 
notion  of  rough  values.  A  rough  value  is  defined  using  an  upper  and  a 
lower  bound.  Rough  values  can  be  used  to  effectively  represent  a  range  or 
set  of  values.  A  gene  in  a  rough  genetic  algorithm  can  be  represented  us¬ 
ing  a  rough  value.  The  paper  describes  how  this  generalization  facilitates 
development  of  new  genetic  operators  and  evaluation  measures.  The  use 
of  rough  genetic  algorithms  is  demonstrated  using  a  simple  document 
retrieval  application. 


1  Introduction 

Rough  set  theory  [9]  provides  an  important  complement  to  fuzzy  set  theory  [14] 
in  the  field  of  soft  computing.  Rough  computing  has  proved  itself  useful  in  the 
development  of  a  variety  of  intelligent  information  systems  [10,11].  Recently, 
Lingras  [4r-7]  proposed  the  concept  of  rough  patterns,  which  are  based  on  the 
notion  of  rough  \^ues.  A  rough  value  consists  of  an  upper  and  a  lower  bound. 
A  rough  value  can  be  used  to  effectively  represent  a  range  or  set  of  values  for 
variables  such  as  daily  temperature,  rain  fall,  hourly  traffic  volume,  and  daily 
financial  indicators.  Many  of  the  mathematical  operations  on  rough  values  are 
borrowed  from  the  inter's^  algebra  [1].  The  interval  algebra  provides  an  ability 
to  deal  with  an  interval  of  numbers.  Allen  [1]  described  how  the  interval  algebra 
can  be  used  for  temporal  reasoning.  There  are  several  computational  issues  asso¬ 
ciated  with  temporal  reasoning  based  on  the  interval  algebra,  van  Beek  [12]  used 
a  subset  of  the  interval  algebra  that  leads  to  computationally  feasible  temporal 
reasoning.  A  rough  value  is  a  special  case  of  an  interval,  where  only  the  upper 
and  lower  boimds  of  the  interval  are  used  in  the  computations.  A  rough  pattern 
consisting  of  rough  values  has  several  semantic  and  computational  advantages  in 
many  analytical  applications.  Rough  patterns  are  primarily  used  with  numerical 
tools  such  as  neural  networks  and  genetic  algorithms,  while  the  interval  algebra 
is  used  for  logical  reasoning. 

Lingras  [7]  used  an  analogy  with  the  heap  sorting  algorithm  and  object  ori¬ 
ented  programming  to  stress  the  importance  of  rough  computing.  Any  compu¬ 
tation  done  using  rough  values  can  also  be  rewritten  in  the  form  of  conventional 
numbers.  However,  rough  values  provide  a  better  semantic  interpretation  of  re¬ 
sults,  in  terms  of  upper  and  lower  bounds.  Moreover,  some  of  the  numeric  com- 
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putations  can  not  be  conceptualized  without  explicitly  discussing  the  upper  and 
lower  bound  framework  [7]. 

This  paper  proposes  a  generalization  of  genetic  algorithms  based  on  rough 
values.  The  proposed  rough  genetic  algorithms  (RGAs)  can  complement  the 
existing  tools  developed  in  rough  computing. 

The  paper  provides  the  definitions  of  basic  building  blocks  of  rough  genetic 
algorithms,  such  as  rough  genes  and  rough  chromosomes.  The  conventional  genes 
and  chromosomes  are  shown  to  be  special  cases  of  their  rough  equivalents.  Rough 
extension  of  G  As  facilitates  development  of  new  genetic  operators  and  evaluators 
in  addition  to  the  conventional  ones.  Two  new  rough  genetic  operators,  called 
union  and  intersection^  are  defined  in  this  paper.  In  addition,  the  paper  also 
introduces  a  measure  called  precision  to  describe  the  information  contained  in  a 
rough  chromosome.  A  distance  measure  is  defined  that  can  be  useful  for  quan¬ 
tifying  the  dissimilarity  between  two  rough  chromosomes.  Both  precision  and 
distance  measures  can  play  an  important  role  in  evaluating  a  rough  genetic  pop¬ 
ulation.  A  simple  example  is  also  provided  to  demonstrate  practical  applications 
of  the  proposed  RGAs. 

Section  2  provides  a  brief  review  of  genetic  algorithms.  Section  3  proposes 
the  notion  of  rough  genetic  algorithms  and  the  associated  definitions.  New  rough 
genetic  operators  and  evzduation  measures  are  also  defined  in  section  3.  Section 
4  contains  a  simple  document  retrieval  example  to  illustrate  the  use  of  rough 
genetic  algorithms.  Summary  and  conclusions  appear  in  section  5. 

2  Brief  Review  of  Genetic  Algorithms 

The  origin  of  Genetic  Algorithms  (GAs)  is  attributed  to  Holland’s  [3]  work 
on  cellular  automata.  There  has  been  significeint  interest  in  GAs  over  the  last 
two  decades.  The  range  of  applications  of  GAs  includes  such  diverse  areas  as 
job  shop  scheduling,  training  neural  nets,  image  feature  extraiction,  and  image 
feature  identification  [2].  This  section  contains  some  of  the  basic  concepts  of 
genetic  algorithms  as  described  in  [2]. 

A  genetic  algorithm  is  a  search  process  that  follows  the  principles  of  evolution 
through  natural  selection.  The  domain  knowledge  is  represented  using  a  candi¬ 
date  solution  called  an  organism.  Typically,  an  organism  is  a  single  chromosome 
represented  as  a  vector  of  length  n: 

C  =  (Ci  I  1  <  »  <  «) .  (1) 


where  Cj  is  called  a  gene. 

A  group  of  organisms  is  called  a  population.  Successive  populations  are  called 
generations.  A  generational  GA  starts  from  initial  generation  G(0),  and  for  ecich 
generation  G{t)  generates  a  new  generation  using  genetic  operators  such 

as  mutation  and  crossover.  The  mutation  operator  creates  new  chromosomes  by 
changing  values  of  one  or  more  genes  at  random.  The  crossover  joins  segments 
of  two  or  more  chromosomes  to  generate  a  new  chromosome.  An  abstract  view 
of  a  generational  GA  is  ^ven  in  Fig.  1. 
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Genetic  Algorithm: 

generate  initial  population,  G(0); 
evaluate  0(0); 

for(t  =  1;  solution  is  not  found,  t-f +) 
generate  G{t)  using  G(t  —  1); 
evaluate  G(<); 


Fig.  1.  Abstract  view  of  a  generational  genetic  algorithm 


3  Definition  of  Rough  Genetic  Algorithms 

In  a  rough  pattern,  the  value  of  each  variable  is  specified  using  lower  and  upper 
bounds: 

x  =  ix,x),  (2) 

where  x  is  the  lower  bound  and  x  is  the  upper  bound  of  x,  A  conventional  pattern 
can  be  easily  represented  as  a  rough  pattern  by  specifying  both  the  lower  and 
upper  boun^  to  be  equal  to  the  value  of  the  variable.  The  rough  values  can  be 
added  as: 

x  +  y  =  (x,x)-^iy,y)  =  {x  +  y,x  +  y),  (3) 

where  x  and  y  are  rough  values  given  by  pairs  {x,x)  and  (2/,17),  respectively.  A 
rough  value  x  can  be  multiplied  by  a  number  c  as: 

c  X  a:  =  c  X  (®,  =  (c  X  c  X  ^),  if  c  >  0,  ,  .x 

c  X  a?  =  c  X  (£,  =  (c  X  c  X  $),  if  c  <  0.  ^  ^ 

Note  that  these  operations  are  borrowed  from  the  conventional  interval  calculus. 
As  mentioned  before,  a  rough  value  is  used  to  represent  an  interval  or  a  set  of 
values,  where  only  the  lower  and  upper  bounds  are  considered  relevant  in  the 
computation. 

A  rough  chromosome  r  is  a  string  of  rough  genes  n: 

»•  =  I  1  <  i  (5) 

A  rough  gene  r<  can  be  viewed  as  a  pair  of  conventional  genes,  one  for  the  lower 
bound  called  lower  gene  (n)  and  the  other  for  the  upper  bound  called  upper 
gene  (ft): 

fi  =  iri,fi),  (6) 


Fig.  2  shows  an  example  of  a  rough  chromosome.  The  value  of  each  rough 
gene  is  the  range  for  that  variable.  The  use  of  a  range  means  that  the  information 
conveyed  by  a  rough  chromosome  is  not  precise.  Hence,  an  information  measure 
called  precision  given  by  eq.  (7)  may  be  useful  while  eveduating  the  fitness  of  a 
rough  chromosome. 


pr€ciai(m{r)  =  — 


Tj-rj 

Rangcmaxiri) 


)■ 


(7) 
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Fig.  2.  Rough  chromosomes  along  with  associated  operators  and  functions 


In  eq.  (7),  Rangemax(Ti)  is  the  length  of  maximum  allowable  range  for  the  value 
of  rough  gene  r^. 

In  Fig.  2, 


preasion(r)  = 

(0.3  -  0.2) 

1.0 

=  -1.0, 


(0.6  -  0.1) 
1.0 

(0.9  -  0.7) 
1.0 


assuming  that  the  maximum  range  of  each  rough  gene  is  [0, 1]. 

Any  conventional  chromosome  can  be  represented  as  a  rough  chromosome 
as  shown  in  Fig.  3.  Therefore,  rough  chromosomes  are  a  generalization  of  con¬ 
ventional  chromosomes.  For  a  conventional  chromosome  c,  precision{c)  has  the 
maximum  possible  value  of  zero. 

New  generations  of  rough  chromosomes  can  be  created  using  the  conventional 
mutation  and  crossover  operators.  However,  the  mutation  operator  should  make 
sure  that  n  >  rj.  Similarly,  during  the  crossover  a  rough  chromosome  should  be 
split  only  at  the  boundary  of  a  rough  gene,  i.e.  a  rough  gene  should  be  treated 
as  atomic. 

In  addition  to  the  conventional  genetic  operators,  the  structure  of  rough  genes 
enables  us  to  define  two  new  genetic  operators  called  union  and  intersection.  Let 


Pig.  3.  Conventional  chromosome  and  its  rough  equivalent 


r  =  (r j  I  1  <  i  <  n)  and  s  =  (si  |  1  <  i  <  n)  be  two  rough  chromosomes 
defined  as  strings  of  rough  genes  ri  and  Si,  respectively.  The  union  operator, 
denoted  by  the  familiar  symbol  U,  is  given  as  follows: 


r  U  5  =  (r<  U  Si  I  1  <  i  <  n)  ,Y/here 
Vi  U  8i  =  {rnin{rus^jrnax{ri,8i)) 


(8) 


The  intersection  operator,  denoted  by  the  the  symbol  n,  is  given  as  follows: 

r  n  s  =  (n  n  Si  I  1  <  t  <  n) ,  where 

_  /min  (min(fi.  Si), moa;(ri.  Si))  ,\  (9) 

””  ^maa:  (mm(fi.  Si),  maa:(^,^))  y 

Fig.  2  illustrates  the  union  and  intersection  operators. 

A  measure  of  similarity  or  dissimilarity  between  two  chromosomes  can  be 
important  during  the  evolution  process.  The  distance  between  two  rough  chro¬ 
mosomes  is  given  as  follows: 


distanceir^  s)  = 


5^  +  (ri-Sif 


(10) 
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The  distance  between  rough  chromosomes  r  and  s  from  Fig.  2  can  be  calculated 
as: 

di8tance{r,  s)  =  y/  (0.2  -  0.3)^  +  (0.4  -  0.5)^ 

+^(0.1  -  0.3)*  +  (0.6  -  0.4)* 

+)/{Q.2-  0.5)*  +  (0.3  -  0.8)* 

(0.7  -  0.6)®  +  (0.9  -  0.7)* 

=  1.23. 


4  An  application  of  rough  genetic  algorithms 

Information  retrieval  is  an  important  issue  in  the  modern  information  age.  A 
huge  amount  of  information  is  now  available  to  the  general  public  through  the 
advent  of  the  Internet  and  other  related  technologies.  Previously,  the  information 
was  made  available  through  experts  such  as  librarians,  who  helped  the  general 
public  analyze  their  information  needs.  Because  of  enhanced  communication 
facilities,  the  general  public  cein  access  various  documents  directly  from  their 
desktop  computers  without  having  to  consult  a  human  expert.  The  modern  in¬ 
formation  retrieval  systems  must  assist  the  general  public  in  locating  documents 
relevant  to  their  needs. 

In  the  traditional  approach,  user  queries  are  usually  represented  in  a  linear 
form  obtained  from  the  user.  However,  the  user  may  not  be  able  to  specify  his 
information  needs  in  the  mathematical  form,  either  because  he  is  not  comfortable 
with  the  mathematical  form,  or  the  mathematical  form  does  not  provide  a  good 
representation  of  his  information  needs  [8].  Wong  and  Yao  [13]  proposed  the  use 
of  perceptrons  to  learn  the  user  query  based  on  document  preference  specified 
by  the  user  for  a  sample  set.  Lingras  [8]  extended  the  approach  using  non-linear 
neural  networks.  This  section  illustrates  how  rough  genetic  algorithms  can  learn 
the  user  query  from  a  sample  of  documents. 

Let  us  consider  a  small  sample  of  documents  o,  6,  c,  and  d.  Let  us  assume  that 
each  document  is  represented  using  four  keywords:  Web  search,  Information  Re¬ 
trieval,  Intelligent  Agents  and  Genetic  Algorithms.  Fig.  4  shows  the  documents 
represented  as  conventional  chromosomes.  The  value  ai  =  0.6  corresponds  to  the 
relative  importance  attached  to  the  keyword  Web  Search  in  document  a.  Simi¬ 
larly,  02  =  0.9  corresponds  to  the  relative  importance  attached  to  the  keyword 
Information  Retrieval  in  document  o,  etc. 

As  mentioned  before,  the  user  may  not  be  able  to  specify  the  precise  query 
that  could  be  matched  with  the  document  set.  However,  given  a  sample  set,  she 
may  be  able  to  identify  relevant  and  non-relevant  documents.  Let  us  assume  that 
the  user  deemed  a  and  b  as  relevant.  The  documents  c  and  d  were  considered 
non-relevant  to  the  user.  This  information  can  be  used  to  learn  a  linear  query 
by  associating  weights  for  each  of  the  four  keywords  [13].  However,  it  may  not 


44 


0.6 

0.9 

0.4 

0.1 

0.1 

0.4 

0.5 

0.9 

0.9 

0.8 

0.1 

0.3 

0.9 

0.2 

0.9 

0.2 

Fig.  4.  Document  set  represented  as  conventional  chromosomes 


be  appropriate  to  associate  precise  weights  for  each  keyword.  Instead,  a  range 
of  weights  such  as  0.3-0.5,  may  be  a  more  realistic  representation  of  the  user’s 
opinion.  A  rough  query  can  then  be  represented  using  rough  chromosomes. 

The  user  may  supply  an  initial  query  and  a  genetic  algorithm  may  generate 
additional  random  queries.  The  evolution  process  given  by  Fig.  1  can  be  used 
until  the  user’s  preference  is  adequately  represented  by  a  rough  chromosome. 
Fig.  5  shows  an  objective  function  which  may  be  used  to  evaluate  the  population 
in  such  an  evolution  process. 


Objective  function: 

repeat  for  all  the  relevant  documents  d 

repeat  for  all  the  non-relevant  documents  d' 
if  di8iance(rtd)  <  di8tance{rfd')  then 
match++; 

return  match; 


Fig.  6.  An  example  of  objective  function  for  document  retrieval 


Let  us  assume  that  r  and  8  in  Fig.  2  are  our  candidate  queries. 
In  that  case, 

distance{r,  a)  =  2.53, 
di8tance{r,  b)  =  1.24, 
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distance{r,c)  =  2.53,  and 
distance{ry  d)  —  3.05. 

Similarly, 

distanceia^a)  =  2.29, 
distance{Sjb)  =  1.21, 
distance{8yc)  =  2.67,  and 
distance{8id)  =  2.00. 

Using  the  objective  function  given  in  Fig.  5,  rough  chromosome  r  evaluates 
to  4  and  a  evaluates  to  3.  Hence,  in  the  natural  selection  process,  r  will  be  chosen 
ahead  of  a.  If  both  of  these  rough  chromosomes  were  selected  for  creating  the 
next  generation,  we  may  apply  genetic  operators  such  as  mutation,  crossover, 
union  and  intersection.  The  results  of  union  and  intersection  of  r  and  a  are  shown 
in  Fig.  2. 

The  example  used  here  demonstrates  a  few  aspects  of  a  rough  genetic  algo¬ 
rithm.  Typically,  we  will  select  twenty  candidate  queries  for  every  generation. 
Depending  upon  a  probability  distribution,  the  four  different  genetic  operators 
will  be  applied  to  create  the  next  generation.  The  evolution  process  will  go  on 
for  several  generations. 

In  practice,  a  document  retrieval  process  will  involve  hundreds  of  keywords. 
Instead  of  classifying  sample  documents  as  relevant  or  non-relevant,  it  may  be 
possible  to  rank  the  documents.  Rough  genetic  algorithms  may  provide  a  suitable 
mechanism  to  optimize  the  search  for  the  user  query.  Results  of  applications  of 
RGAs  for  web  searching  will  appear  in  a  future  publication.  An  implementation 
of  rough  extensions  to  a  genetic  algorithm  library  is  also  currently  underway  and 
may  be  available  for  distribution  in  the  future. 


5  Summary  and  Conclusions 

This  paper  proposes  Rough  Genetic  Algorithms  (RGAs)  based  on  the  notion  of 
rough  values. 

A  rough  value  consists  of  an  upper  and  a  lower  bound.  Variables  such  as  daily 
temperature  are  associated  with  a  set  of  values  instead  of  a  single  value.  The 
upper  and  lower  bounds  of  the  set  can  represent  variables  using  rough  values. 

Rough  equivalents  of  basic  notions  such  as  gene  and  chromosomes  are  defined 
here  as  part  of  the  proposal.  The  paper  also  presents  new  genetic  operators, 
namely,  union  and  intersectiony  made  possible  with  the  introduction  of  rough 
computing.  These  rough  genetic  operators  provide  additional  flexibility  for  cre¬ 
ating  new  generations  during  the  evolution.  Two  new  evaluation  measures,  called 
precision  and  distancCy  are  also  defined.  The  precision  function  quantifies  infor¬ 
mation  contained  in  a  rough  chromosome,  while  the  distance  function  is  used 
to  calculate  the  dissimilarity  between  two  rough  chromosomes.  A  simple  docu¬ 
ment  retrieval  example  was  used  to  demonstrate  the  usefulness  of  RGAs.  Rough 
genetic  algorithms  seem  to  provide  useful  extensions  for  practical  applications. 
Future  publications  will  present  results  of  such  experimentation. 
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Classifying  Fauits  in  High  Voitage  Power  Systems: 
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Abstract:  This  paper  introduces  an  approach  to  classifying  faults  in  high  voltage 
power  system  with  a  combination  of  rough  sets  and  fuzzy  sets  in  a  neural  computing 
framework.  Typical  error  signals  important  for  fault  detection  in  power  systems  are 
considered.  Features  of  these  error  signals  derived  earlier  using  Fast  Fourier  Transform 
analysis,  amplitude  estimation  and  waveform  type  identification,  provide  inputs  to  a 
neural  network  used  in  classifying  faults.  A  form  of  rough  neuron  with  memory  is 
introduced  in  this  paper.  A  brief  overview  of  a  rough-fuzzy  neural  computational 
method  is  given.  The  learning  performance  of  a  rough-fuzzy  and  pure  fuzzy  neural 
network  are  compared. 

Keywords:  Approximation,  calibration,  classification,  faults,  fuzzy  sets,  rough 
neuron,  rough  sets,  neural  network,  high  voltage  power  system 

1  Introduction 

A  file  of  high  voltage  power  system  faults  recorded  by  the  Transcan  Recording  System 
(TRS)  a  Manitoba  Hydro  in  the  past  three  years  provides  a  collection  of  unclassified 
signals.  The  TRS  records  power  system  data  whenever  a  fault  occurs.  However,  the 
TRS  does  not  classify  faults  relative  to  waveform  types.  To  date,  a  number  of  power 
system  fault  signal  readings  have  been  visually  associated  with  seven  waveform  types. 
In  this  paper,  a  combination  of  rough  set  and  fuzzy  set  are  used  in  a  neural  computing 
framework  to  classify  faults.  Rough  neural  networks  (rNNs)  were  introduced  in  1996 
[1],  and  elaborated  in  [2]-[4].  This  paper  reports  research-in-progress  on  classifying 
power  system  faults  and  also  introduces  the  design  of  neurons  in  rNNs  in  the  context  of 
rough  sets. 

This  paper  is  organized  as  follows.  Waveform  types  of  power  system  faults  are 
discussed  in  Section  2.  The  basic  concepts  of  rough  sets  and  design  of  a  rough  neural 
network  are  presented  in  Section  3.  An  overview  of  a  form  of  rough-fuzzy  neural 
computation  is  given  in  Section  4.  In  this  section,  the  performance  comparison 
between  rough-fuzzy  neural  network  and  pure-fuzzy  neural  network  is  also  provided. 
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2  Power  System  Faults 

Using  methods  described  in  [5],  a  group  of  26  pulse  signals  relative  to  seven  types  of 
waveforms  have  been  selected  for  this  study  (see  Table  1).  Each  value  in  Table  1 
specifies  the  degree-of-membership  of  a  pulse  signal  in  the  waveform  of  a  particular 
fault  type.  Values  greater  than  0.5  indicate  “definite”  membership  of  a  signal  in  a  fault 
class.  Values  below  0.5  indicate  uncertainty  that  a  signal  is  of  a  particular  fault  type. 
From  line  7  of  Table  1,  a  value  of  0.6734  indicates  a  high  degree  of  certainty  that  a 
pole  line  flasher  signal  has  a  type  2  waveform. 


Table  1  Sample  Power  System  Faults  Relative  to  Waveform  Type 


Fault 

Degree-of-membership  /  Waveform  Type 

BSSI 

BSB 

wSSSM 

Value  Cab 

0.0724 

0.0231 

0.0381 

0.8990 

msimm 

0.0201 

AC  filter  test 

0.0752 

0.0270 

0.0447 

0.1102 

0.0259 

0.0158 

0.1383 

0.0446 

0.1300 

0.0506 

0.0410 

0.0567 

500  Kv  close 

0.0862 

0.1234 

0.2790 

0.1224 

0.2083 

0.8334 

pole  line  flash 

0.0369 

0.3389 

0.0600 

0.0251 

msKmm 

0.0214 

pole  line  flash 

0.0340 

0.6734 

0.0573 

0.0237 

0.1539 

0.0271 

0.0201 

pole  line  flash 

0.0327 

MlWHCKii 

0.0231 

0.1537 

0.0263 

0.0231 

pole  line  flash 

0.0337 

0.4836 

0.0561 

0.0211 

0.1767 

0.0283 

0.0221 

pole  line  flash 

0.0329 

0.5336 

0.0582 

0.0241 

0.0205 

pole  line  retard 

mmmsm 

0.2056 

0.0548 

0.0230 

0.0854 

mnm&M 

0.0156 

3  Classifying  Faults 

In  this  paper,  the  classification  of  six  high  voltage  power  system  faults  relative  to 
candidate  waveforms  is  carried  out  with  a  neural  network  which  combines  the  use  of 
rough  sets  and  fuzzy  sets. 

3.1  Basic  Concepts  of  Rough  Sets 

Rough  set  theory  offers  a  systematic  approach  to  set  approximation  [6]-[7].  To  begin, 
let  S  =  (U,  A)  be  an  information  system  where  U  is  a  non-empty  finite  set  of  objects 
and  A  is  a  non-empty  finite  set  of  attributes  where  Cl’.U  for  every  a  g  A.  For 
each  B  c  A,  there  is  associated  an  equivalence  relation  Ind^CA)  such  that 


Ind^(fi)  =  {(*,y )  e  £/^  I  Vo  e  B.  a(x)  =  a(x' )) 

If  (x,  x')  G  IndgCA),  we  say  that  objects  x  and  x'  are  indiscernible  from  each  other 
relative  to  attributes  from  B.  The  notation  [x]b  denotes  equivalence  classes  of  IndgCA). 
For  X  c  U,  the  set  X  can  be  approximated  only  from  information  contained  in  B  by 

constructing  a  B-lower  and  B-upper  approximation  denoted  by  BX  and  BX  respectively. 
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where  BX  =  {x  I  Ix]^  c  X }  and  BX  =  {  x  !  [xjg  n X  0  }.  The  objects  of  BX 
can  be  classified  as  members  of  X  with  certainty,  while  the  objects  of  BX  can  only  be 
classified  as  possible  members  of  X.  Let  BNb(X)  =  BX  -  BX.  A  set  X  is  rough  if 
BNb(X)  is  not  empty.  The  notation  a^(X)  denotes  the  accuracy  of  an  approximation, 
where 


«bW  = 


BX 

BX 


where  I  X  I  denotes  the  cardinality  of  the  non-empty  set  X,  and  a^CX)  e  [0,  1].  The 
approximation  of  X  with  respect  to  B  is  precise,  if  a^iX)=  1,  Otherwise,  the 
approximation  of  X  is  rough  with  respect  to  B,  if  oc^iX)  <  1. 

3.2  Example 

Let  PLF  denote  a  pole  line  fault  in  a  high  voltage  power  system.  The  set  P  =  {x  I 
PLF2(x)  =  yes)  consists  of  pole  line  fault  readings  which  are  judged  to  be  type  2 
waveforms  (see  Table  2). 


Table  2.  Sample  PLF2  Decision  Table 


PLF„ 

PLF2 

xl 

in  fT,  11 

yes 

x2 

in  [0.  P) 

no 

x3 

infr,  11 

yes 

x4 

in[j3,T) 

yes/no 

x5 

inrr,  11 

yes 

In  effect,  PLF2  is  a  decision  attribute  whose  outcome  is  synthesized  in  terms  of  hidden 
condition  attributes.  To  see  this,  let  T,  /?  be  thresholds  used  to  assess  the  candidacy 
of  a  fault  reading  in  particular  type  of  waveform  and  numerical  boundary  separating  the 
possible  approximation  region  from  the  rest  of  the  universe,  respectively.  Recall  that  a 
power  system  fault  f  is  considered  to  be  a  waveform  of  type  t  if  the  degree-of- 
membership  of  t  is  greater  than  or  equal  to  some  threshold.  Next,  we  construct  a 
sample  decision  table  for  pole  line  faults  of  type  2  (see  Table  2).  From  Table  2,  we 
obtain  approximation  regions  BP=  {0.6734,  0.5836,  0,4836,  0.5336}  and  BP  = 
(0.3389,  0.6734,  0.5836,  0.4836,  0.5336}  relative  to  condition  attributes  B  (see  Fig. 
1).  The  set  of  pole  line  fault  readings  being  classified  is  rough,  since  the  boundary 
region  BNb(P)  =  {0.3389}  in  Fig.  1  is  non-empty.  The  accuracy  of  the  approximation 
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is  high,  since  a^{P)  =  4/5.  We  use  the  idea  of  approximation  regions  to  design  a 
rough  neuron. 


Fig.  1  Approximating  the  Set  of  Pole  Line  Fault  Recordings 

Notice  that  six  power  system  faults  are  represented  in  Table  1.  The  degree-of- 
membership  of  each  fault  is  computed  relative  to  seven  types  of  waveforms.  This 
leads  to  42  different  rough  sets,  seven  rough  sets  for  each  of  the  set  of  fault  readings. 

3.3  Design  of  Rough  Neurons 

A  neuron  is  a  processing  element  in  a  neural  network.  Informally,  a  rougl^ neuron  is  a 
processing  element  designed  to  construct  approximation  regions  BX  and  BX  based  on 
the  evaluation  of  its  input  X  on  the  basis  of  knowledge  in  a  set  of  condition  attributes 
B.  Let  r^  be  a  rough  neuron  with  memory.  Let  X,  T  be  a  set  of  unclassified  fault 
readings  and  threshold  used  to  assess  the  candidacy  of  a  fault  reading  in  a  particular  type 
of  waveform,  respectively.  Further,  let  a  be  a  degree-of-membership  function  such 
that  a:X->[0,l].  Internally,  a  rough  neuron  r„,  performs  the  following 
computation  on  each  x  e  X. 

BXu{jc},  if  a(x)  >  T 
r^(x)  =  '  BX  u  {x],  if  a(x)  <  T 

U-BXu{jc),  ifO  <  a(x)  <  P 

In  effect,  a  rough  neuron  constructs  approximation  regions  over  time.  To  make  this 
possible,  a  rough  neuron  is  endowed  with  memory.  During  calibratilon,  the 
approximation  regions  from  the  previous  epoch  are  recalled  and  updated  during  the 
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current  epoch.  The  accuracy  of  the  approximation  computed  by  a  rough  neuron  is 
measured  by  (Xg{X),  which  is  the  output  of  a  rough  neuron. 

4  Rough-Fuzzy  Neural  Network  Computation 

The  basic  features  of  a  rough-fuzzy  neural  computing  algorithm  used  to  classify  the 
waveforms  of  power  system  faults  are  shown  in  the  flowchart  in  Fig.  2. 


selected  inputs:  e.g, 
main  bandwidth, 
subband  amplitude 
of  pulse  signal 

- P! 


compute  degree-of-membership  pulse 
signal  in  fault  type  i  waveform 


1 

L 

t-j 

P 

'  \ 

\ 

compute  degree-of-membership  pulse 
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Fig,  2  Rough-Fuzzy  Neural  Computation 


The  details  of  the  underlying  network  have  been  omitted  due  to  space  constraints.  The 
computation  in  Fig.  2  begins  with  the  initialization  of  modulator  r  and  strengths-of- 
connection  w,  u.  During  calibration,  r,  w,  u  will  be  adjusted  until  the  error  Q  is  less 
than  some  threshold  5. 

Let  — >  [0,1]  be  a  degree-of-membership  function  used  in  Fig.  2.  The  output 

of  each  rough  neuron  is  aggregated  with  the  degree-of-membership  values  using  a  t-,  s- 
norm  and  implication  ( — > )  operations  from  fuzzy  set  theory  to  compute  Zj. 
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In  the  next  stage  of  the  neural  computation  in  Fig.  2,  the  weighted  z-values  are 
aggregated  with  an  s-,  t-norm  operations  to  compute  y. 

4.1  Calibration 

The  flowchart  in  Fig.  2  has  a  feedback  loop  used  to  calibrate  r,  w,  and  u  relative  to 
target  values  of  the  estimated  type-of-fault.  The  calibration  scheme  for  rough-fuzzy 
neural  networks  exploits  the  learning  method  given  in  [8].  What  follows  is  a  brief 
summary  of  the  calibration  steps: 

1 .  Initialize  modulator  r  and  strength-of-connections  w  and  u. 

2.  Introduce  a  training  set  representing  data  sets  of  values  containing  information 
of  fault  signals. 

3.  Compute  y  of  the  output  neuron. 

4.  Compute  error  indices  Q,  by  comparing  network  outputs  with  target  values 
using  Q  =  target  -  y. 

5.  Let  a>0  denote  the  positive  learning  rate.  Based  on  the  values  of  the  error 
indices,  adjust  the  r,  w,  and  u  parameters  using  the  usual  gradient-based 
optimization  method  suggested  in  (1)  and  (2), 

paraminew)  =  param  -  a - 

dparam 

dQ  dQ  dy 

dparam  3y  dparam 


(1) 

(2) 


4.2  Learning  Performance  of  Two  Types  of  Networks 

A  rough-fuzzy  and  pure  fuzzy  neural  network  have  been  calibrated,  and  compared  (see 
Figures  3,  4,  and  5). 
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Fig.  3  Rough-fuzzy  network  performance  Fig.  4  Performance  of  fuzzy  network 

A  plot  showing  a  comparison  of  learning  performance  of  these  networks  during 
calibration  is  given  in  Figures  3  and  4.  It  is  clear  that  for  the  same  learning  iteration, 
the  performance  of  rough-fuzzy  neural  network  is  better  than  that  of  pure-fuzzy  neural 
network.  After  the  calibration  of  both  neural  networks,  all  of  the  connections  relative 
to  the  r,  w  and  u  parameters  have  been  determined.  To  test  the  performance  of  the 
sample  rough-fuzzy  and  pure  fuzzy  neural  networks,  we  utilize  an  additional  26  data  sets 
of  fault  signals. 
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Notice  in  Fig.  5  that  the  estimation  of  the  fault  type  by  the  rough-fuzzy  neural  network 
is  more  precise  than  that  of  the  pure-fuzzy  neural  network. 

5  Concluding  Remarks 

The  design  of  a  rough  neuron  in  the  context  of  rough  sets  has  been  given.  The  output 
of  a  rough  neuron  is  an  accuracy  of  approximation  measurement,  which  is  granulated 
and  used  in  conjunction  with  aggregation  methods  from  fuzzy  sets  to  classify  the  type 
of  waveform  of  detected  high  voltage  power  system  faults.  This  work  is  part  of  a  study 
begun  at  Manitoba  Hydro  in  1998. 
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Abstract.  Rough  mereology  is  a  paradigm  allowing  to  blend  main  ideas  of  two 
potent  paradigms  for  approximate  reasoning  :  fuzzy  set  theory  and  rough  set 
theory.  Essential  ideas  of  rough  mereology  and  schemes  for  approximate  reason¬ 
ing  in  distributed  systems  based  on  rough  mereological  logic  were  presented  in 
[13],  [14],  [17].  Spatial  reasoning  is  an  extensively  studied  paradigm  stretching 
from  theoretical  investigations  of  proper  languages  and  models  for  this  reason¬ 
ing  to  applicational  studies  concerned  with  e.g.  geographic  data  bases,  satellite 
image  analyses,  geodesy  applications  etc.  We  propose  a  rough  mereological  envi¬ 
ronment  for  spatial  reasoning  under  uncertainty.  We  confront  our  context  with 
an  alternatively  studied  mereological  context  defined  within  Calculus  of  Individ¬ 
uals  [10]  by  Clarke  [5]  and  developed  into  schemes  for  spatial  reasoning  in  [2], 
[3]  where  the  reader  will  find  examples  of  linguistic  interpretation.  We  outline 
how  to  define  in  the  rough  mereological  domain  the  topological  and  geometri¬ 
cal  structures  which  are  fundamental  for  spatial  reasoning;  we  show  that  rough 
mereology  allows  for  introducing  notions  studied  earlier  in  other  mereological 
theories  [2],  [3],  [5].  This  note  sums  up  a  first  step  toward  our  synthesis  of  intel¬ 
ligent  control  algorithms  useful  in  mobile  robotics  [1],  [7],  [8]. 

Keywords  rough  mereology,  mereotopology,  spatial  reasoning,  connection, 
rough  mereological  geometry 

1  Introduction 

Rough  mereology  has  been  proposed  in  [13]  and  developed  into  a  paradigm  for 
approximate  reasoning  in  [14].  Its  applications  to  problems  of  approximate  syn¬ 
thesis,  control,  design  and  analysis  of  complex  objects  have  been  discussed  in 
[17]  and  in  [15]  a  granular  semantics  for  computing  with  words  was  proposed 
based  on  rough  mereology.  We  are  concerned  here  with  the  issues  of  spatial  rea¬ 
soning  under  uncertainty.  Therefore  we  study  the  rough  mereological  paradigm 
in  a  geometric  -  mereotopological  setting  (cf.  [2],  [3]).  Spatial  reasoning  plays 
an  important  role  in  intelligent  robot  control  (cf.  [1],  [7],  [8]  and  we  are  aiming 
at  synthesizing  a  context  for  control  under  uncertainty  of  a  mobile  robot  which 
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may  possibly  involve  natural  language  interfaces.  Rough  Mereology  is  a  natu¬ 
ral  extension  of  Mereology  (cf.  [11],  [18])  and  we  give  as  well  a  brief  sketch  of 
relevant  theories  of  Ontology  and  Mereology  to  set  a  proper  language  for  our 
discussion. 

2  Ontology 

Ontological  theory  of  Lesniewski  [9],  [18]  is  concerned  with  the  explanation  of 
meaning  of  phrases  like  is  F”  .  Naive  set  theory  solves  this  problem  via 
the  notion  of  an  element;  in  Ontology,  the  esti  symbol  €  is  replaced  by  the 
copula  e  (read  ”is”).  Ontology  makes  use  of  functors  of  either  of  two  categories: 
propositional  and  nominal;  the  former  yield  propositions  the  latter  new  names. 
We  begin  this  very  concise  outline  of  Ontology  by  selecting  symbols  X,  F,  Z  .... 
to  denote  names  (of  objects);  the  primitive  symbol  of  ontology  is  e  (read  ”is”). 
The  sole  Axiom  of  Ontology  is  a  formula  coding  the  meaning  of  e  as  follows 

2.1  Ontology  Axiom 

XeY  IZ.ZeX  AVC/,  W,{UeX  A  WeX  =>  UeW)  A  Vr.(TeX  ^  TeY) 

This  axiom  determines  the  meaning  of  the  formula  XeY  (”XzsF”)  as  the 
conjunction  of  three  conditions:  BZ.ZeX  ("something  is  X”);  W,W.{U€X  A 
WeX  =>  UeW)  ("any  two  objects  which  are  X  are  identical"  i.e.  X  is  an 
individual  name);  'iT.{TeX  TeY)  ("everything  which  is  X  is  F”). 

Therefore  the  meaning  of  the  formula  XeY  is  as  follows:  X  is  a  non-empty 
name  of  an  individual  (X  is  an  individual)  and  any  object  which  is  X  is  also  F. 

We  introduce  a  name  V  defined  via  :  XeV  <^==>  3Y.XeY  being  a  name 
for  a  universal  object.  The  copula  e:  formalized  as  above  permits  to  accomodate 
distributive  classes  (counterparts  of  sets  in  the  naive  set  theory).  The  next  step  is 
to  formalize  the  notion  of  distributive  classes  (counterparts  of  unions  of  families 
of  sets  ).  This  belongs  to  Mereology. 

3  Mereology 

Mereology  of  Lesniewski  [11],  [19]  can  be  based  on  any  of  a  few  primitive  notions 
related  one  to  another:  part,  element,  class..;  here,  we  begin  with  the  notion  of 
a  part  conceived  as  a  name  -  forming  functor  pt  on  individual  names. 

3.1  Mereology  Axioms 

We  start  with  basic  axioms  for  pt. 

(MEl)  Xept{Y)  ^Z.ZeX  A  XeV  A  FeF; 

(ME2)  Xept{Y)  A  Yept{Z)  Xept{Z); 

(ME3)  non{Xept{X)). 

Then  Xept(Y)  means  that  the  individual  denoted  X  is  a  proper  part  (in 
virtue  of  (ME3))  of  the  individual  denoted  F.  The  concept  of  an  improper  part 
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is  reflected  in  the  notion  of  an  element  el]  this  is  a  name  -  forming  functor 
defined  as  follows: 

XeeliY)  ^  Xept{Y)  WX  =  Y. 

We  will  require  that  the  following  inference  rule  be  valid. 

(ME4)  \/T.{Teel{X)  3WWeel{T)  A  Weel{Y))  Xeel{Y). 

3.2  Classes 

The  notion  of  a  collective  class  may  be  introduced  at  this  point;  this  is  effected 
by  means  of  a  name  -  forming  functor  Kl  defined  as  follows. 

XeKl{Y)  ^ 

3Z.ZeY  A  yZ.{ZeY  Zeel{X))A 

yZ.{Zeel(X)  3U,W.UeYAW£el{U)AWeel{Z)), 

The  notion  of  a  class  is  subjected  to  the  following  restrictions 
(ME5)  XeKl{Y)  A  ZeKl{Y)  ==>  ZeX  {Kl{Y)  is  an  individual); 

(ME6)  3Z.ZSY  3Z.Z£KI{Y)  (the  class  exists  for  each  non-empty  name). 
Thus,  Kl{Y)  is  defined  for  any  non-empty  name  Y  and  Kl{Y)  is  an  individual 
object.  One  can  also  introduce  a  less  restrictive  name  viz.  of  a  set: 

X€set{Y) 

3Z.Z£YAyZ.{Zeel{X)  3U,W.U£YAWeel(U)AW£el{Z)). 

Thus,  a  set  is  like  a  class  except  for  the  universality  property  \/Z.{ZeY  => 
Zeel{X)). 

3.3  Mereotopology:  first  notions 

Within  mereology  one  may  define  (cf.  [11])  some  functors  expressing  relative 
position  of  objects.  The  functor  ext  expresses  disjointness  in  terms  of  parts: 
Xeextiy)  4=^  non{3Z.Zeel(X)  A  Z£el{Y)). 

The  notion  of  a  complement  is  expressed  by  the  functor  comp  : 
X£comp{Y^relZ)  Y£sub(Z)  A  X£Kl{elZ\extY) 
where  U£elZ\extY  iff  U£el{Z)  A  U£ext{Y). 

4  Rough  Mereology 

Approximate  Reasoning  carried  out  under  Uncertainty  needs  a  weaker  form  of 
part  predicate:  of  being  a  part  in  a  degree.  The  degree  of  being  a  part  may  then 
be  specified  either  on  the  basis  of  a  priori  considerations  and  findings  or  directly 
from  data  [14].  In  our  construction  of  rough  mereoogical  predicate,  we  are  guided 
by  the  tendency  to  preserve  Mereology  as  an  exact  skeleton  of  reasoning  . 

Rough  Mereology  has  been  proposed  and  studied  in  [13],  [14],  [17]  as  a  first- 
order  theory.  Here,  we  propose  a  formalization  in  the  framework  of  Ontology; 
hence,  rough  mereology  becomes  now  a  genuine  extension  of  mereology  in  a 
unified  framework.  By  virtue  of  our  earlier  studies  cited  above,  we  may  now 
assume  that  rough  mereology  is  defined  around  a  certain  mereological  theory  as 
its  extension.  We  therefore  assume  that  a  mereological  predicate  el  of  an  element 
is  given  and  e  is  a  symbol  for  ontological  copula  as  defined  above. 
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4.1  Rough  Mereology  Axioms 

The  following  is  a  list  of  axiomatic  postulates  for  Rough  Mereology.  We  introduce 
a  graded  family  //r,  where  r  6  [0, 1]  is  a  real  number  from  the  unit  interval,  of 
name-forming  functors  of  an  individual  name  which  would  satisfy 

(RMl)  Xe^LiiX)  XeeliX)  degree  1  is  an  element); 

(RM2)  XeyiiiX)  =^'iZ.{Zeiir{X)  ZsiirO^))  (monotonicity); 

(RM3)  X  =  Y  f\  Xe^r{Z)  =>  Yejjiri^)  (identity  of  objects); 

(RM4)  X£fj,r{y)  A  5  <  r  =4-  XcfXsiY)  (meaning  of  //y*  a  part  in  degree  at 
least  r); 

we  introduce  a  following  notational  convention: 

XeiJtf{Y)  X£fj,r(Y)  Anon{3s  >  r.XsfjisiY)). 

In  some  versions  of  our  approach,  we  adopt  one  more  axiom 
(RMS)  X£ext{Y)  ==^  XefiQ  {Y)  (disjointness  of  objects  is  fully  recognizable) 
or  its  weakened  form  expressing  uncertainty  of  our  reasoning 
(RMS)*  Xeextiy)  3r  <  l.X£fif{Y)  (disjointness  is  recognizable  up  to 
a  bounded  uncertainty). 


4.2  Models 

One  may  have  as  an  archetypical  rough  mereological  predicate  the  rough  mem¬ 
bership  function  of  Pawlak  and  Skowron  [12]  defined  in  an  extended  form  as: 

XeMY)  ^  >  r 

where  X,  Y  are  (either  exact  or  rough)  subsets  in  the  universe  U  of  an  infor¬ 
mation/decision  system  (U^A). 


4.3  Mereotopology:  Cech  Topologies 

Topological  structures  are  important  for  spatial  reasoning:  setting  the  interior 
and  the  boundary  of  an  object  apart,  allows  for  expressing  various  spatial  re¬ 
lations  of  contact  (cf.  eg.  [2],  [3]).  We  point  here  that  (weak)  topologies  are 
immanent  to  rough  mereological  structures. 

We  define  an  object  KlrX,  each  X,  r  <  1,  as  follows: 

ZeKlrX  ZeKl{^rX)  where  Ze/jirX  Ze{ir{X). 

Thus  KlrX  is  the  class  of  all  objects  Z  such  that  Ze^r{X). 

A  simplified  description  of  KlrX  may  be  provided  as  follows. 

Let  ByX  be  defined  via:  ZeBrX  3T.Zeel{T)  ATsfirX. 

Then  we  have 

Proposition!.  KlrX  —  BrX. 

Proof.  Let  Z£el{BrX)\  there  is  T  such  that  Zeel{T)  and  Te^rX.  Hence  the 
followingis  true:  VZ..^ec/(RrX)  ==}^  ^U.U€el(Z)AUeel\KlrX)  BrXeel{KlrX) 
follows  by  (ME4).  Similarly,  for  Zeel{KlrX),  we  have  P,  Q  with  P£el{Z)^  P£el{Q), 
Q£fj,r{X).  Hence  P£el{BrX)  and  (ME4)  implies  that  KlrX£el{BrX)  so  finally, 
KlrX  =  BrX. 

There  is  another  property,  showing  the  monotonicity  of  class  operators. 
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Proposition  2.  For  s  <r^  KlrXeel{KlsX), 

Indeed,  by  the  previous  fact,  Zeel{KlrX)  implies  that  ZeeUT)  and  TsfXrX 
for  some  T  hence  TefigX  and  a  fortiori  Zee/ 

Introducing  a  constant  name  A  ( the  empty  name)  via  the  definition: 

XeA  ^  XeX  A  naniXeX) 

and  defining  the  interior  IntX  of  an  object  X  as  follows: 

IntXeKl{intJC) 

where  ZeintJX  4=^  3T.3r  <  \.Zeel{KlrT)  A  KlrTeel{X) 
i.e.  IntX  is  the  class  of  objects  of  the  form  KlrT  which  are  elements  of  X, 
we  have 

Propositions,  (i)  IntAe  IntA  AeA  (the  interior  of  the  empty  concept  is 
the  empty  concept); 

(ii)  Xeel{Y)  IntXeel{IntY)  (monotonicity  of  Int); 

(iii) IntKlV sKlV  (the  universe  is  open). 

Properties  (i)-(iii)  witness  that  the  family  of  all  classes  KlrT^  r  <  1,  is  a  base 
for  a  Cech  topology  [21];  we  call  this  topology  the  rough  mereological  topology 
{rm-topology). 

5  Prom  Cech  mereotopologies  to  mereotopologies 

We  go  a  step  further:  we  make  rm-topology  into  a  topology  (ie.  open  sets  have 
open  intersections);  this  comes  at  a  cost:  we  need  a  specific  model  for  rough 
mereology. 


5.1  A  t-norm  model 

We  recall  that  a  t-norm  is  a  2-argument  functor  T{x,y)  :  [0, 1]^  — >  [0, 1]  satis¬ 
fying  the  conditions: 

(i)  T(x,j/)  =  T(2/,2:);  (ii)  T(a:,l)  =  a:;  (iii)  x'  >  x,y'  >  y  ^  T(a:',3/')  > 
T(®,j/);  (iv)  T(a:,T(y,2:))  =  T(T(a:,2/),z) 

and  that  the  residual  implication  induced  by  T,in  symbols  T,  is  defined  via 

T^(r,s)  >t<!=^T(i,r)  <s. 

We  apply  here  the  ideas  developed  in  [14]  and  we  define,  given  a  part  in 
degree  predicate  //,  a  new  measure  of  partiality  in  degree,  defined  as  follows 
(*)  Xep.T{r){Y)  ^  ^Z.{Zepi{u){X)  A  Zep{v){Y)  T(w,u)  >  r). 

It  turns  out  that 

Proposition4.  The  functor  pr  satisfies  axioms  (RM1)-(RM5),  (RM5)*. 

Proof.  We  may  check  (RMl):  XeprWiY)  implies  that  from  Z€fJ,(u){X)  A 
ZefJi(v){Y)  it  follows  that  u  <  u  for  each  Z  hence:  Zsel{X)  Zeel{Y) 
follows  for  any  Z  i.e.  Xeel(Y).  Similarly,  X£el{Y)  implies  via  (RM2)  for  (i 
that  Z£pi{u){X)  A  Z£pl{v){Y)  yields  u  <  v  i.e.  l^(w,t;)  >  1  for  any  Z  thus 
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XeiiT{l){y)^  (RM2),  (RMS),  (RM4)  are  checked  similarly,  for  (RMS),  we  begin 
with  the  premise  Xeext(Y)  hence  XeiJ,Q(Y);  assuming  X€fjiTir)(Y)  we  get  by 
(*)  for  Z  =  X  that  1^(1, 0)  >  r  i.e.  T(r,  1)  =  r  <  0.  Similar  argument  handles 
(RMS)*. 

Thus  jiT  is  a  partiality  in  degree  predicate. 

Modifying  a  proof  given  in  ([9],  Prop. 14),  we  find  that  the  following  deduction 
rule  is  valid  for  fij  : 

- XyT(T(r,.))(z/'  • 

We  denote  with  the  symbol  Klr^jX  the  class  KlrX  with  respect  to  /zy- 
We  may  give  a  new  characterization  of  Kl^jX, 

Propositions.  Y£el{Klr;TX)  ^f^Y£fjLT[T'){X). 

Indeed,  Y£el{Klr,TX)  means  that  Yeel{Z)  and  Z£iXT{r){X)  for  some  Z,  From 
Y£iit{Y){Z)  and  'ZeiiT{T){X)  it  follows  by  (MPR)  that  ye^T(T(l,r)  =  r)(X). 

We  may  regard  therefore  KlrjX  as  a  ’’ball  of  radius  r  centered  at  X”  with 
respect  to  the  ’’metric”  /zy- 

Furthermore,  we  have  by  the  same  argument 

Propositions.  YEel{Klr;TX)  and  Sq  =  min _arg(T(r, s)  >  r)  imply 
Klso,TYeel{Klr,TX). 


It  follows  that  the  family  {Klr,rX  :  r  <  1,X}  induces  a  topology  on  our 
universe  of  objects  (under  the  assumption  that  T(r,s)  <  1  whenever  rs  <  1). 
This  allows  us  to  define  a  variety  of  functors  like:  Tangential  Part,  Non-tangential 
Part  etc.  instrumental  in  spatial  reasoning  (cf.  [2],  [3]). 


6  Connections 

We  refer  to  an  alternative  scheme  for  mereological  reasoning  based  on  Clarke’s 
formalism  of  connection  C  [5]  in  Calculus  of  Individuals  of  Leonard  &:Goodman 
[10];  see  in  this  respect  [3].  This  formalism  is  a  basis  for  some  schemes  of  ap¬ 
proximate  spatial  reasoning  (eg.  various  relations  of  external  contact,  touching 
etc.  may  be  expressed  via  C)  (op.cit.).  The  basic  primitive  in  this  approach  is 
the  predicate  C{X,  Y)  (read  ”X  and  Y  are  connected”)  which  should  satisfy  :  (i) 
C{X,  X);  (ii)  C(X,  Y)  C(Y,  X);  (iii)  VZ.((7(X,  Z)  ^  C(Y,  Z))  ^  X  =  Y. 
From  C  other  predicates  (as  mentioned  above)  are  generated  and  under  addi¬ 
tional  assumptions  (cf.  [5])  a  topology  may  be  generated  from  C. 

We  will  define  a  notion  of  connection  in  our  model;  clearly,  as  in  our  model 
topological  structures  arise  in  a  natural  way  via  ’’metrics”  fi,  we  may  afford  a 
more  stratified  approach  to  connection  and  separation  properties.  So  we  propose 
a  notion  of  a  graded  connection  C{r,  s). 
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6.1  Prom  graded  connections  to  connections 

We  let 

BdrXeKlin+X)  where  Ze/i+iX)  ^  Zsfir(X)  Anon{Zefie{X),s  >  r) 
and  then 

XeC{r,s){Y)  ^  3W.Weel{BdrX)  A  Weel{BdsY). 

We  have  then  clearly: 

(i)  X£C7(1,1)(X);  (ii)  XeC{r,s)(Y)  =>  YeC(s,r)iX). 

Concerning  the  property  (iii),  we  may  have  some  partial  results  However,  we 
adopt  here  a  new  approach.  It  is  realistic  from  both  theoretical  and  applicational 
points  of  view  to  assume  that  we  may  have  ’’infinitesimal  ”  parts  i.e.  objects  as 
’’small”  with  respect  to  ju  as  desired. 


Infinitesimal  parts  model  We  adopt  a  new  axiom  of  infinitesimal  parts 
(IP)  n(m{X£el{Y))  Vr  >  03Z€el{X),s<  r.Zsfi+O^)^ 

Our  rendering  of  the  property  (iii)  under  (IP)  is  as  follows: 
non{X£el{Y))  ==^  Vr*  >  0.3Z,  s  <  r.Z£jj>f{Y).Z£C{l^  1)(-^)  A  Ze:C(l,  5)(y). 

Introducing  connections  Our  notion  of  a  connection  will  depend  on  a  thresh¬ 
old,  a,set  according  to  the  needs  of  a  context  of  reasoning. 

Given  0  <  a  <  1,  we  let 

(CON)  XeCaiY)  ^3r,s>  a,X£C{r,s){Y), 

Then  we  have 

(i)  XeCaiX),  each  a; 

(ii)  XeCaiY)  YeCaiX); 

(hi)  X^Y=>  BZ.iZeCaiX)AncmiZ€CaiY))yZ£CaiY)Anon{Z£CaiX))) 
i.e.  the  functor  Ca  has  all  the  properties  of  connection  in  the  sense  of  [5]  and 
[2],  [3]. 


Restoring  rough  mereology  from  connections  We  show  now  that  when 
we  adopt  mereological  notions  as  they  are  defined  via  connections  in  Calculus  of 
Individuals,  we  do  not  get  anything  new:  we  come  back  to  rough  mereology  we 
started  from.  The  formula 

XeelciY)  VZ.(ZeC'(X)  ZeCiY))  is  the  definition  of  the  notion  of 
an  element  from  a  connection  C.  We  claim 

Proposition 7.  Xeelccc^X)  X£el{Y). 

Clearly,  Xeelca{Y)  Xeel{Y).  Assume  that  XeeliY);  ZeCaiX).  There 
is  W  with  We//+(Z)  and  We^+(X),  r,5  >  a.  Then  by  (RM2),  W£/iJ(y)  with 
an  >  s  and  so  ZeCa{Y)^  It  follows  that  Xeel{Y)  ==^  Xeelc^iy)- 

Any  of  connections  Ca  restores  thus  the  original  notion  of  an  element,  el. 

Therefore  in  our  setting  of  rough  mereology,  we  may  have  as  well  the  mereotopo- 
logical  setting  of  [2],  [3],  [5]. 
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Let  us  observe  that  in  general  Ca  ^  OV  where  X€OV{Y)  <=>  3Z.Zeel(X)A 
Zeel{Y)  is  the  functor  of  overlapping  (in  our  context,  objects  may  connect  each 
other  without  necessarily  having  a  part  in  common). 


7  Geometry  via  rough  mereology 

It  has  been  shown  that  in  the  mereotopological  context  of  Calculus  of  Individ¬ 
uals  one  may  introduce  a  geometry  (cf.  [3]).  We  show  that  in  the  context  of 
rough  mereology  geometric  structures  arise  naturally  without  any  resort  to  the 
intermediate  structure  of  connection.  It  is  well  known  that  elementary  geometry 
may  be  developed  on  the  basis  of  eg.  the  primitive  notion  of  ’’being  closer  to  ... 
than  to..”.  We  consider  here  the  axioms  for  this  notion  going  back  to  Tarski  (cf. 
eg.  [4])  and  we  introduce  a  name  -  forming  functor  on  pairs  of  individual  names 
CtIy,  Z)  {XeCTiY,  Z)  is  read  is  closer  to  Y  than  to  Z”)  subject  to 
(CTl)  XeCT(Y,  Z)  A  XeCT{Z,  W)  XeCTiY,  W); 

(CT2)  XeCT{Y,  Z)  A  ZeCT{X,  Y)  YeCT{X,  Z); 

(CT3)  non(XeCT(r,r)); 

(CT4)  XeCT{Y,  Z)  XsCTiY,  W)  V  X€CT(W,  Z). 

We  define  this  notion  in  the  context  of  rough  mereology:  for  X,  F,  we  let 
y)  =  r  Xen'i:{Y)  and  then 

XeCT{Y,Z)  ^  max(Az+(X,y),M+(y,X))  >  max(/x+(X,  Z),/i+(Z,X)). 
Then 

Propositions.  The  functor  CT  thus  defined  satisfies  (CT1)-(CT4)- 


We  may  now  follow  e.g.  the  lines  of  [4],  [3]  and  give  definitions  of  a  other  geo¬ 
metric  notions;  for  instance,  letting 

r(x,  y,  z)  ^  vw.x  =  w  v  cr(y,  x,  w)  v  ct{z,  x,  w) 

we  may  render  the  notion  that  X  is  positioned  between  Y  and  Z  and  this  may 
permit  to  define  a  straight  line  segment  and  further  notions  as  pointed  to  in  e.g. 
[4].  The  details  will  be  presented  elsewhere  (cf.  [16]). 

8  Conclusion 

We  have  presented  a  scheme  for  developing  conceptual  spatial  reasoning  under 
uncertainty  in  the  framework  of  rough  mereology.  In  this  framework,  as  it  will 
be  presented  elsewhere,  we  may  develop  various  approaches  to  spatial  reasoning, 
including  metric  geometry  based  on  predicates  p,  and  metrics  derived  from  them. 
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Abstract.  Rough  sets  theory  depending  upon  DIS{Determinisiic  In¬ 
formation  System)  is  now  becoming  a  mathematical  foundation  of  soft 
computing.  Here,  we  pick  up  N I S{N on- deterministic  Information  Sys¬ 
tem)  which  is  more  general  system  than  DIS  and  we  try  to  develop 
the  rough  sets  theory  depending  upon  NIS.  We  first  give  a  definition 
of  definabihty  for  every  object  set  X,  then  we  propose  an  algorithm  for 
checking  it.  To  find  an  adequate  equivalence  relation  from  NIS  for  X 
is  the  most  important  part  in  this  algorithm,  which  is  like  a  resolution. 
According  to  this  algorithm,  we  implemented  some  programs  by  prolog 
language  on  the  workstation. 


1  Introduction 

Rough  sets  theory  is  seen  as  a  mathematical  foundation  of  soft  computing,  which 
covers  some  areas  of  research  in  AI,  i.e.,  knowledge,  imprecision,  vagueness, 
learning,  induction[l, 2,3,4].  We  recently  see  many  applications  of  this  theory  to 
knowledge  discovery  and  data  mining[5,6,7,8,9]. 

In  this  paper,  we  deal  with  rough  sets  in  7V'/5 (Non-deterministic  Information 
System),  which  will  be  an  advancement  from  rough  sets  in  D/5(Deterministic  In¬ 
formation  System).  According  to  [1,2],  we  define  every  DIS  =  (OB,  AT,  {V ALa\ 
a  e  AT},  f),  where  OB  is  a  set  whose  element  we  call  object,  AT  is  a  set  whose  el¬ 
ement  we  call  attribute,  V ALa  for  a  G  AT  is  a  set  whose  element  we  call  attribute 
value  and  /  is  a  mapping  such  that  /  :  OB  *  AT  — >  Ua^ATVALa^  which  we  call 
classification  function.  For  every  x,y(x  ^  y)  e  OB,  if  f(x,a)  —  /(y^Ci)  for 
every  a  G  AT  then  we  see  there  is  a  relation  for  x  and  y.  This  relation  becomes 
an  equivalence  relation  on  OB,  namely  we  can  always  define  an  equivalence  re¬ 
lation  EQ  on  OB.  If  a  set  X(C  OB)  is  the  union  of  some  equivalence  classes  in 
EQ,  then  we  call  X  is  definable  in  DIS.  Otherwise  we  call  X  is  rough[l]. 

Now  we  go  to  the  NIS.  We  define  every  NIS  =  (OB,AT,{VALa\a,  G 
AT},g),  where  ^  is  a  mapping  such  that  g  :  OB  *  AT  — ^  P(\JaeATV ALa) 
(Power  set  for  UoeAT^^7^a)[3,4].  We  need  to  remark  that  there  are  two  interpre¬ 
tations  for  mapping  g,  namely  AND-interpretation  and  OR-interpretation.  For 
example,  we  can  give  the  following  two  interpretations  for  g(tom,  language)  = 
{English,  Polish,  Japanese}. 

(AND-interpretation)  Tom  can  use  three  languages,  English,  Polish  and  Jap¬ 
anese.  Namely,  we  see  g(tom,  language)  is  English  A  Polish  A  Japanese. 
(OR-interpretation)  Tom  can  use  either  one  of  language  in  English,  Polish  or 
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Japanese.  Namely  we  see  g[tom^language)  is  English  V  Polish  W  Japanese. 
The  OR-interpretation  seems  to  be  more  important  for  g.  Because,  it  is  related 
to  incomplete  information  and  uncertain  information.  Furthermore,  knowledge 
discovery,  data  mining  and  machine  learning  from  incomplete  information  and 
uncertain  information  will  be  important  issue.  In  such  situation,  we  discuss  NTS 
with  OR-interpretation.  We  have  already  proposed  incomplete  information  and 
selective  information  for  OR-interpretation[10],  where  we  distinguished  them  by 
the  existence  of  unknown  real  value.  In  this  paper,  we  extend  the  contents  in 
[10]  and  develop  the  algorithm  for  finding  equivalence  relations  in  NIS. 

2  Aim  and  Purpose  in  Handling  NIS 

Now  in  this  section,  we  show  the  aim  and  purpose  in  handling  NIS.  Let’s  con¬ 
sider  the  following  example. 

Example  1.  Suppose  the  next  NISi  such  that  OB  =  {1, 2, 3, 4},  AT  =  f  Aj  By  C}, 
UaeArVALa  =  {1,2,3}  and  g  is  given  by  the  following  table.  In  this  table. 


OB 

A 

B 

C 

1 

1  V2 

2 

1  V2  V3 

2 

1 

2 

1  V2V3 

3 

1 

1  V2 

2 

4 

1 

2 

2V3 

Table  1.  Non-deterministic  Table  for  NISi 


if  we  select  an  element  for  every  disjunction  then  we  get  a  DIS.  There  are 
72(=2*3*3*2*2)  DISs  for  this  NISi.  In  this  case,  we  have  the  following  issues. 

Issue  1:  For  a  set  {1, 2}(c  OB),  if  we  select  1  from  p{l,  A)  and  3  from  ^(1,  C), 
g{2y  C)  and  ^(4,  C)  then  {1,  2}  is  not  definable.  However,  if  we  select  1  from 
gllyC)  and  g{2,C)  then  {1,2}  is  definable.  How  can  we  check  such  defin¬ 
ability  for  every  subset  X  of  OB  ? 

Issue  2;  How  can  we  get  all  possible  equivalence  relations  from  72  DISs  ?  Do 
we  have  to  check  72  DISs  sequentially  ? 

Issue  3:  Suppose  there  are  following  information  for  attribute  D:  g{lyD)  =  {!}, 
g{2yD)  =  {!},  5^(3,  D)  =  {2}  and  p(4,D)  =  {2},  respectively.  In  this  case, 
which  DIS  from  NISi  makes  (A,^,^)  — D  consistent  ?  How  can  we  get 
all  DISs  which  make  (A,  H,  C)  D  consistent  ? 

These  issues  come  from  the  fact  such  that  the  equivalence  relation  in  DIS  is 
always  unique  but  there  are  some  possible  equivalence  relations  for  NIS. 
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Now  we  just  a  little  show  the  real  execution  for  Issue  2  to  clarify  how  our 
system  works. 


7-relationall. 

[1]  [[1.2,3,4]]  1 

[2]  [[1.2,3]. [4]]  1 

[3]  [[1.2,4],[3|]  3 

[4]  [[1.2]. [3, 4]]  2 

[5]  [[1.2]. [3]. [4]]  6 

[6]  [[1.3.4], [2]]  2 

[7]  [[1,3], [2. 4]]  1 

[8]  [[1.3], [2], [4]]  1 

[9]  [[1,4], [2, 3]]  1 


[10]  [[1.4], [2], [3]]  6 

[11]  [[1].[2.3.4]]  6 

[12]  [[1].[2,3].[4]]  4 

[13]  [[1].[2.4].[3]]  14 

[14]  [[1],[2].[3,4]]  8 

[15]  [[1].[2].[3],[4]]  19 
POSSIBLE  CASES  72 
EXEC-TIME=0 . 1666100121 (sec) 
yes 


In  the  above  execution,  we  see  there  are  15  kinds  of  equivalence  relations  and 
there  are  19  DISs  whose  equivalence  relation  is  {{!},  {2},  {3},  {4}}.  Accord¬ 
ing  to  this  execution,  we  can  see  that  2  cases  of  {{1,2},  {3, 4}},  5  cases  of 
{{1,2},  {3},  {4}},  8  cases  of  {{!},  {2},  {3,4}}  and  19  cases  of  {{!},  {2},  {3}, 
{4}}  make  (A,  B,C)  — >  D  consistent  by  Proposition  4.1  in  [1].  In  the  subse- 
quent  sections,  we  discuss  the  definability  of  every  set  in  NIS  as  well  as  the 
above  issues. 


3  An  Algorithm  for  Checking  Definability  of  Set  in  DIS 

In  this  section,  we  simply  refer  to  an  algorithm  to  detect  the  definability  of  set 
in  DIS.  Here,  we  suppose  an  equivalence  relation  EQ  in  the  DIS  and  we  use 
[rr]  to  express  an  equivalence  class  with  object  x. 

An  Algorithm  in  DIS 

(1)  Make  a  set  SUP{~  U® 

(2)  If  SUP  =  X  then  X  is  definable  in  DIS  else  go  to  the  next  step  (3). 

(3)  Make  a  set  INF{=  U{[a;]  G  EQ\[x\  C  X}),  then  lower  and  the  upper  ap¬ 
proximation  of  X  are  INF  and  SUP.,  respectively. 

The  above  algorithm  manages  the  definability  of  set  X,  upper  and  the  lower 
approximation  of  X.  We  will  propose  a  new  algorithm  in  NIS  depending  upon 
the  above  one. 


4  Some  Definitions  and  Properties  in  NIS 

We  first  give  some  definitions  then  we  show  a  proposition. 

Definition  1.  For  NIS  =  {OB,AT,{VALa\a  G  AT},g),  we  call  NIS'  =  {OB, 
AT,{VALa\(i  G  AT}yg')  which  satisfies  the  following  (1)  and  (2)  an  extension 
from  NIS. 

(1)  g'{x,  a)  C  g{x,  a)  for  every  x  G  OB,  a  G  AT. 

(2)  g'{x,  a)  is  a  singleton  set  for  every  x  G  OB,  a  G  AT. 

Here,  we  can  see  every  extension  from  NIS  is  a  DIS,  because  every  attribute 
value  is  fixed  uniquely. 
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Definition  2.  For  every  extension  NIS'  from  7V/5,  we  call  the  equivalence 
relation  in  NIS'  a  possible  equivalence  relation  in  NIS.  We  also  call  every 
element  in  this  relation  a,  possible  equivalence  class  in  NIS. 

In  every  DIS^  we  know  the  definability  of  a  set  X,  so  we  give  the  next  definition. 
Definition  3.  A  set  X(c  OB)  is  definable  in  NIS,  if  X  is  definable  in  some 
extensions  from  NIS. 

We  soon  remind  a  way  to  detect  the  definability  of  a  set  X  in  NIS,  namely 
we  sequentially  make  every  extension  from  NIS  and  execute  the  program  by 
algorithm  in  DIS.  However,  we  need  the  same  number  of  files  as  extensions 
from  NIS.  Furthermore,  if  X  is  not  definable  in  NIS  then  we  have  to  execute 
the  same  program  for  all  extensions.  So  we  propose  another  way  from  now  on. 
We  give  the  following  definitions. 

Definition  4.  Suppose  NIS  —  {OB,AT,{VALa\(i  €  AT},g).  If  9{x,a)  is  a 
singleton  set  for  every  a  G  AT  then  we  call  that  object  x  is  fixed.  Furthermore, 
OB  fixed  =  {a;  G  OB\  object  x  is  fixed  }. 

Definition  5.  Suppose  NIS  =  {OB,AT,{VALa\a  G  AT},g)^  and  g{x,a)  is  not 
a  singleton  set  for  some  a  G  AT.  By  picking  up  an  element  in  such  g{x,a),  we 
can  make  object  x  fixed.  Here,  we  call  a  set  of  pairs  {[attribute,  picked -element]} 
selection  in  x.  For  a  selection  $,  Xe  expresses  the  fixed  tuple  for  x. 

In  Example  1,  if  we  take  a  selection  B  =  {[A,  1],  [C,  1]},  then  the  1^  is  (1, 2, 1). 
For  0  =  {[B,2]},  the  3^  is  (1,2,2). 

Definition  6.  Suppose  NIS  =  {OB,AT,{VALa\a  G  AT},g).  For  every  rr(G 
OB)  and  selection  6  in  x,  we  give  the  following  definitions. 

(1)  inf{x,B)  =  {a;}  U  {y  G  OBfi^edl  xe  and  the  tuple  for  y  are  the  same  }. 

(2)  sup{x,  $)  —  {y  £  OB\  there  is  a  selection  B'  such  that  xq  =  ye>}- 
According  to  these  definitions,  we  get  the  following  proposition. 

Proposition  1. 

(1)  The  inf{x,  B)  is  the  minimal  possible  equivalence  class  including  object  x  for 
the  selection  B. 

(2)  For  every  y  G  {sup(x,  B)  —  inf{x,  B)),  there  are  selections  B'  and  B^'  such  that 
Xe  =  ye'  and  xe  ^  ye"‘ 

(3)  A  subset  X(c  OB)  which  satisfies  inf[x,  B)  C  X  C  sup(x,  B)  for  some  x  and 
B  can  be  a  possible  equivalence  class. 

(Proof)  (1)  For  x  and  B,  the  tuple  for  every  y  G  inf{x,  B)  is  the  same  and  fixed. 
So  inf{x,B)  is  a  minimal  possible  equivalence  class  with  x  for  the  selection  B. 

(2)  For  y  G  {sup{x,B)  ~  inf{x,B)),  we  get  y  G  sup{x,B)  and  y  ^  inf(x,B).  By 
the  definition  of  sup,  there  is  a  selection  B'  such  that  xe  =  ye' >  y  €  OB  fixed 
then  y  G  inf{x,  B),  which  makes  contradiction  to  y  ^  inf{x,  B).  So  y  OBfixedy 
and  there  exists  at  least  another  selection  such  that  ye"  ^  xe- 

(3)  According  to  (1)  and  (2),  inf{x,B)  U  M  for  M  C  (sup{x,B)  —  inf{x,B))  can 
be  a  possible  equivalence  class. 

In  this  proposition,  the  (3)  is  related  to  the  definability  of  set  in  NIS  and  we 
use  this  property.  However,  we  have  to  remark  that  inf{x,B)  and  sup{x,B)  are 
not  independent  in  every  x.  The  inf{x,  B)  and  sup{x,  B)  are  mutually  related  to 
other  inf{y,B*)  and  sup{y,B').  We  show  it  in  the  next  example. 
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Example  2.  Suppose  NISi  in  Example  1.  The  OBfixed  =  0  we  get  the 
following  subset  of  all  inf  and  sup. 

(A)  in/(l,{[A,l],[C,3]})  =  {1},  s«p(l,{[A,l],[C,3]})  =  {1,2,4}. 

(B)  inf  (3,  {[B,  1]})  =  {3},  sup{3,  {[B,  1]})  =  {3}. 

(C)  inf  {3,  {[B,  2]})  =  {3},  sup(3,  {[B,  2]})  =  {1,2, 3, 4}. 

(D)  in/(4,{[C,2]})  =  {4},  sup{i,{[C,2]})  =  {1,2, 3, 4}. 

(E)  inf  (4,  {[C,  3]})  =  {4},  sup{4,  {[C,  3]})  =  {1, 2, 4}. 

Here  in  (A),  the  following  sets  {!},  {1,2},  {l,4}  and  {1,2,4}  can  be  a  possible 
equivalence  class  by  (3)  in  Proposition  1.  However,  if  we  make  {1,2}  a  possible 
equivalence  class,  then  we  implicitly  make  object  4  ^  [1](=  [2]).  It  implies  selec¬ 
tion  [C,  3]  for  object  4  is  rejected,  because  4{[c,3]}  is  (1,2,3)  which  is  the  same 
as  1{[a,i],(c,3]}-  Namely,  we  can  not  use  (E)  and  we  have  to  revise  (C)  and  (D) 
as  follows: 

(C’)  m/(3,{[B,2]})  =  {3},  sup(3,  {[B,2]})  =  {3,4}. 

(D’)  inf  {4,  {[C,  2]})  =  {4},  sup{4,  {[C,  2]})  =  {3, 4}. 

If  we  use  (B)  then  [3]  =  {3}  and  reject  the  (C’),  because  either  (B)  or  (C’)  hold. 
Here,  we  have  to  revise  (D’)  as  follows: 

(D”)  inf  (4,  {[C,  2]})  =  {4},  sup(4,{[C,2]})  =  {4}. 

For  this  (D”),  only  {4}  can  be  a  possible  equivalence  class.  Finally  we  get  a  pos¬ 
sible  equivalence  relation  {{1, 2},  {3},  {4}}  and  the  selections  are  {[A,  1],  [C,  3]} 
for  object  1,  {[C,3]}  for  2,  {[^,1]}  for  3  and  {[C,2]}  for  4.  These  selections 
specify  a  DIS  from  NIS.  We  also  know  that  sets  like  {1,2,3}  and  {3,4}  are 
definable  in  NIS  but  {2,3}  is  not  defiable  in  this  DIS. 

5  Proposal  of  An  Algorithm  in  NIS 

The  following  is  the  overview  of  proposing  algorithm. 

An  Algorithm  for  Checking  Definability  of  Set  in  NIS 
Suppose  we  are  given  inf{x^0)  and  sup{x^0)  for  every  x(e  OB). 

Input:  A  set  X{c  OB). 

Output:  X  is  definable  in  NIS  or  not. 

(1)  Set  X*  =  X. 

(2)  For  the  first  element  a:(E  X*),  find  X'(C  X*)  such  that  inf{x^  0)  C  X'  C 
sup{x,  0)  for  some  0. 

(3)  The  usable  inf{y,9')  and  sup{y,d')  for  y  e  OB  are  restricted  by  selecting 
X'  in  (2).  So,  check  the  usable  inf  and  sup,  and  go  to  (4). 

(4)  If  there  is  no  contradiction  in  (3),  then  set  [x]  =  X',  X*  =  X*  —  X'  and 
go  to  (2).  Especially  if  X*  =  0  then  we  conclude  X  is  definable.  To  find 
other  cases,  backtrack  to  (2).  If  there  is  contradiction  in  (3),  then  backtrack 
to  (2)  and  try  another  X^  If  there  is  no  branch  for  backtrack,  then  we 
conclude  X  is  not  definable. 

In  this  algorithm,  if  we  set  X  =  OB  then  we  can  get  all  possible  equivalence 
relations.  This  algorithm  seems  to  be  simple  and  natural,  but  managing  the  inf 
and  sup  is  very  complicated.  We  also  need  to  discuss  how  we  get  inf  and  sup 
information  from  NIS. 
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6  Implementation  of  Proposing  Algorithm  in  NIS 

Now  in  this  section,  we  show  the  implementation  of  a  prover  for  NIS.  We  depend 
upon  Prolog  language  on  workstation  for  implementing  this  prover.  Our  prover 
consists  of  the  following  two  subsystems: 

(1)  File  translator  from  data  file  to  an  internal  expression. 

(2)  Query  interpreter  with  some  subcommands. 

6.1  Data  File  for  NIS 

Here,  we  show  the  data  file  in  prolog,  which  is  very  simple.  We  use  two  kinds  of 
atomic  formulas: 

ohj€ct{number-of -object  number -of -attributes). 
data{object,  tuple-data). 

The  following  is  the  real  data  file  for  NIS\. 

object(4,3).  data(l,[[l,2],2,[l,2,3]]).  data(2, [1,2, [1,2,3]]) . 
data(3, [1, [1,2] ,2]) .  data(4, [1 ,2 , [2 ,3]] ) . 

We  use  a  list  to  express  disjunction.  This  data  structure  is  so  easy  that  we  can 
soon  make  this  file  from  every  non-deterministic  table.  There  is  no  restrictions 
for  every  number  of  items  except  prolog  and  workstation’s  restriction. 

6.2  File  Translator  from  Data  File  to  Internal  Expression 

This  translator  creates  an  internal  expression  from  every  data  file,  which  consists 
of  the  following  three  kinds  of  atomic  formulas. 

cond{object,  number -f  or -selection.,tuple-f  or  -thisselection). 
pos{object.,  number-of -all selections). 

conn{[object^  number -for  selection].,  [slist,  s/istl],  [mlist^  mlistl]^  maylist). 
As  for  the  2nd,  3rd  and  4th  arguments  in  conn,  we  will  show  their  contents  by 
using  real  execution.  The  following  is  the  translation  of  data  file. 

?-consult (nkbtf.pl) . 

yes 

?-go. 

File  Name  for  Read  Open: ’nkbda23.pl ^ . 

File  Name  for  Write  Open: ’out .pi’ . 

EXEC_TIME=0 . 05459904671 (sec) 
yes 

In  this  translation,  nkbtf.pl  is  the  translator  and  nkbda23.pl  is  a  data  file  for 
NISi.  The  file  out.pl  keeps  the  internal  expression  for  NISi.  The  following  is  a 
part  of  internal  expression  for  object  3. 
cond(3,l, [1,1,2]). 
cond(3,2, [1,2,2] ) . 
pos(3,2) . 

conn([3,l]  ,  [[3]  ,  [1]]  ,  [[]  ,  []]  ,  [[3,1]])  . 

conn([3,2] , [[3] , [2]] , [[1,2,4] , [3,2,1]] , [[3,2] , [1,3] , [2,2] , [4,1]]) . 

The  pos(3,  2)  shows  there  are  two  selections  for  object  3  and  cond(3, 1,  [1, 1, 2]) 
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does  [1,1,2]  is  the  tuple  for  the  first  selection  0{=  {[B,l]}).  In  this  selection, 
the  2nd  argument  in  conn{[3, 1],-,-,-)  shows  inf  {3, 6)  =  {3}  and  3rd  argument 
does  sup{Z,6)  -  m/(3, 0)  =  0.  Similarly  for  the  second  selection  0'{=  {[5, 2]}) 
which  makes  tuple  [1,2,2],  we  get  inf  (z, O')  =  {3}  and  sup{Z,0')  -  m/(3, 0')  = 
{1,2,4}.  Here,  we  identify  the  selections  0  with  the  second  argument  in  cond. 
For  example,  we  identify  a  selection  0  =  {[B,  1]}  as  the  second  argument  1  in 
cond(3, 1,  [1, 1, 2]). 

Definition  7.  For  cond{object^numher^for.selection^tuple)^  we  call  number^ 
for. selection  an  index  of  selection  0  and  we  do  [oh ject,  number. for. selection] 
an  index  of  the  fixed  tuple. 

6.3  An  Algorithm  for  Translator 

Now  we  simply  show  the  translation  algorithm,  which  consists  of  two  phases. 
In  Phasel^  we  create  cond{object,., .)  and  pos(object,.)  from  data{object,.). 
For  every  data(object^list)^  we  first  make  the  cartesian  products  from  list  then 
sequentiadly  we  assert  cond{object^  selection^  fixedJuple),  and  finally  we  assert 
pos{ohject^  last. number). 

In  Phase2,  we  make  every  conn([object^  selection]^. .,.,.)  from  every  cond. 
For  every  cond^object,  selection^  fixed.tuple),  we  first  initialize  lists  [slist^  slistl] 
and  [mlist,mlistl]  and  we  find  other  cond{object\  selection' ,  fixed Juple).  If 
pos{object',l)  then  we  add  [object',  selection']  to  [slist,  slistl]  else  we  do  to 
[mlist,mlistl].  We  continue  it  for  all  selections.  Finally,  we  assign  the  union  of 
[slist,  slistl]  and  [mlist,mlistl]  to  maylist  and  assert  conn{[object,  selection], 
[sHsf,  shstl],  [mlist,  mlistl],  maylist).  We  have  realized  the  translator  according 
to  this  algorithm. 

6.4  An  Algorithm  for  Handling  Usable  inf  and  sup 

In  proposing  algorithm,  the  most  difficult  part  is  to  handle  every  subset  of  objects 
from  usable  inf  and  sup.  The  usable  inf  and  sup  are  dynamically  revised,  so  we 
need  to  manage  what  are  the  usable  inf  and  sup.  For  example  in  the  translated 
conn([3,2],[[3],[2]],[[l,2,4],[3,2,lll,  -),  every  {3}  U  M(M  C  {1,  2,4})  can  be  a 
possible  equivalence  class  by  Proposition  1.  To  make  {1, 3}  a  possible  equivalence 
class,  we  need  to  positively  use  object  1  in  {1,2,4}  and  negatively  use  objects  2 
and  4  in  {1,2,4}. 

Definition  8.  For  X  C  OB,  suppose  inf{x,0)  C  X  C  sup{x,e)  for  some 
X  G  OB.  In  this  case,  we  call  every  element  in  X  positive  use  of  index  [a;,  9]  and 
every  element  in  {sup(x,  0)  —  X)  negative  use  of  [a?,  6], 

To  manage  such  two  kinds  of  usage,  we  adopt  a  positive  list  PLIST  and  a 
negative  list  NLIST.  The  PLIST  keeps  indexes  [object,  selection]  which  have 
been  applied  as  positive  use,  and  the  NLIST  keeps  indexes  which  have  been 
applied  as  negative  use.  For  these  two  lists  and  positive  and  negative  use,  we 
have  the  following  remarks. 

Remark  for  Positive  Use  of  [x,  0] 

Suppose  the  index  for  xe  is  [x,  num].  The  xq  is  applicable  as  positive  use  only 
when  [x,  _]  ^  PLIST  and  [a;,  num]  ^  NLIST, 
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Remark  for  Negative  Use  of  [ar,  0] 

Suppose  the  index  for  xg  is  [x,num].  The  xg  is  applicable  as  negative  use  in  the 
following  cases; 

(1)  [x^num]  G  NLIST. 

(2)  [x,num]  ^  NLIST,  [x,num]  ^  PLIST  and  [x,num']  G  PLIST  for  num  ^ 
num‘ . 

(3)  [x,nurYii\  i  NLIST,  [x,  _]  ^  PLIST  and  there  is  at  least  [a:,  num"]  ^  NLIST 
for  num  ^  num" . 

The  above  remarks  avoid  the  contradiction  such  that  [x,  B\  is  applied  not  only 
positive  use  but  also  negative  use.  The  third  condition  in  negative  use  shows 
that  [x,  num]  G  NLIST  for  all  num  does  not  hold. 

Now  we  show  the  algorithm  for  finding  a  possible  equivalence  class. 

An  Algorithm:  candidate 

Input:  X  =  {ari,  •  •  • ,  a:„}  C  OB,  inf,  sup,  current  PLIST  and  NLIST. 

Output:  There  is  a  possible  equivalence  class  [xi]  C  X  such  that  inf{xi,0)  C 
[ari]  C  sup{xi,0)  or  not. 

(1)  Pick  up  a  selection  0  such  that  inf{xi,d)  C  X.  If  we  can  not  pick  up  such 
selection  then  respond  there  is  no  possible  equivalence  class. 

(2)  If  every  element  in  inf(xi,  9)  is  applicable  as  positive  use  then  go  to  (3)  else 
go  to  (1)  and  try  another  selection. 

(3)  Pick  up  a  set  M(C  (sup{xi,9)  —  inf{xi,9)))  and  go  to  (4).  If  we  can  not 
pick  up  any  other  M  then  go  to  (1)  and  try  another  selection. 

(4)  If  M  C  {X  -  inf{xi,9))  and  every  element  in  M  is  applicable  as  positive 
use  then  set  PLIST  <—  PLIST \J  {[y,0']\y  G  inf{xi,9)[JM,ygi  =0:1^0}  and 
go  to  (5)  else  go  to  (3)  and  try  another  M. 

(5)  If  every  element  in  {8up{xi,9)  —  {inf{xi,0)  U  M))  is  applicable  as  negative 
use  then  go  to  (6)  else  go  to  (3)  and  try  another  M. 

(6)  Set  NLIST^NLISTU{[y,e^]\y  G  {sup{xi,9)  -  {inf{xi,e)  \JM)),yg,  = 
Xi^g}.  Respond  [ri](=  inf{xi,9)  U  M)  can  be  a  possible  equivalence  class. 

According  to  this  algorithm,  we  realized  a  program  candidate  which  responses 
a  possible  equivalence  class  depending  upon  the  current  PLIST  and  NLIST. 

6.5  Realization  of  Query  Interpreter  and  Its  Subcommands 

Now  we  show  the  basic  programs  class  depending  upon  the  algorithm  candidate. 
This  class  manages  the  definability  of  a  set  in  NIS. 
class (X , Y , EQUIV , Ppre , Pres , Npre , Nres) 

: ”X== [] , EQUIV=Y , Pres^Ppre , Nres=Npre . 
class ( [XiXl] ,Y, EQUIV, Ppre, Pres, Npre, Nres) 

: -candidate ( [X I XI] , CAN, Ppre, Pres 1 , Npre, Nres 1) , 
minus (  [X I XI] , CAN , REST) , 

class (REST , [CAN  1 Y] , EQUIV , Presl , Pres , Nres 1 , Nres) . 

In  class,  the  second  argument  Y  keeps  the  temporary  set  of  equivalence  classes, 
the  fourth  argument  Ppre  does  the  temporary  PLIST  and  the  sixth  argument 
Npre  does  the  temporary  NLIST.  In  the  second  clause,  we  first  make  a  set 
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CAN{c.  [X|X1])  which  satisfies  all  conditions,  then  we  execute  the  class  for  a  set 
{[X\X1]-CAN)  again.  If  this  {[X\X1]-CAN)  is  empty  set,  then  the  first  clause 
is  called  and  the  temporary  items  are  unified  to  response  variable  EQUIV ,  Pres 
and  Nres.  After  finding  a  refutation  for  c/ass,  we  get  an  equivalence  relation 
and  DIS.  We  have  also  prepared  some  subcommands  depending  upon  c/ass, 
classeXy  relatioriy  relationex  and  relationall. 

Now,  we  just  show  the  real  execution  times  for  some  NISs. 

(CASEl)  In  NIS\y  we  got  two  DISs  for  Telationex[^ly  2],  [3,4]])  in  0.0697(sec). 
(CASE2)  The  number  of  object  is  20,  attribute  is  10,  DIS  from  NIS  is  648(= 
2^  *3'^).  It  took  0.1646(sec)  for  translation.  For  c/ass([l,  2, 3, 4, 5]),  we  got  no 
DIS  in  0.0018(sec).  For  c/flss([l,2,3,4,5,6]),  we  got  324  DISs  in  0.0481(sec). 
For  relationall  which  is  the  most  heavy  query,  we  got  48  possible  equivalence 
relations  in  2.0513(sec). 

(CASES)  The  number  of  object  is  70,  attribute  is  4,  DIS  from  NIS  is  34992(= 
2^  *  3^).  It  took  0.3875(sec)  for  translation.  For  c/ass{[l,  2, 3, 4, 5]),  we  got  no 
DIS  in  0.0053(sec).  For  relaiionally  we  got  4  possible  equivalence  relations  in 
215.6433(sec).  The  relations  come  from  20736  DISs,  2592  DISs,  10368  DISs 
and  1296  DISsy  respectively. 

7  Concluding  Remarks 

In  this  paper,  we  discussed  the  definability  of  set  in  NIS  and  proposed  an  al¬ 
gorithm  for  checking  it.  The  algorithm  candidate  takes  the  important  roll  for 
realizing  some  programs,  which  will  be  a  good  tool  for  handling  NIS.  We  will 
apply  our  framework  to  machine  learning  and  knowledge  discovery  from  NIS. 
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Abstract.  The  rough  set  theory,  based  on  the  conventional  indiscerni- 
bility  relation,  is  not  useful  for  analysing  incomplete  information.  We  in¬ 
troduce  two  generalizations  of  this  theory.  The  first  proposal  is  based  on 
non  symmetric  similarity  relations,  while  the  second  one  uses  valued  tol¬ 
erance  relation.  Both  approaches  provide  more  informative  results  than 
the  previously  known  approach  employing  simple  tolerance  relation. 


1  Introduction 

Rough  set  theory  has  been  developed  since  Pawlak’s  seminal  work  [5]  (see  also 
[6])  as  a  tool  enabling  to  classify  objects  which  are  only  “roughly”  described, 
in  the  sense  that  the  available  information  enables  only  a  partial  discrimination 
among  them  although  they  are  considered  as  different  objects.  In  other  terms, 
objects  considered  as  “distinct”  could  happen  to  have  the  “same”  or  “similar” 
description,  at  least  as  far  as  a  set  of  attributes  is  considered.  Such  a  set  of 
attributes  can  be  viewed  as  the  possible  dimensions  under  which  the  surrounding 
world  can  be  described  for  a  given  knowledge.  An  explicit  hypothesis  done  in 
the  classic  rough  set  theory  is  that  all  available  objects  are  completely  described 
by  the  set  of  available  attributes.  Denoting  the  set  of  objects  as  A  =  {ai,  •  •  •  Un} 
and  the  set  of  attributes  as  C  =  {ci  ,•••  c^n}  it  is  considered  that  Vaj  €  A,  Ci  6  C, 
the  attribute  value  always  exists,  i.e.  Ci{aj)  ^  0. 

Such  a  hypothesis,  although  sound,  contrast  with  several  empirical  situations 
where  the  information  concerning  the  set  A  is  only  partial  either  because  it  has 
not  been  possible  to  obtain  the  attribute  values  (for  instance  if  the  set  A  are 
patients  and  the  attributes  are  clinical  exams,  not  all  results  may  be  available  in 
a  given  time)  or  because  it  is  definitely  impossible  to  get  a  value  for  some  object 
on  a  given  attribute. 

The  problem  has  been  already  faced  in  literature  by  Grzymala  [2],  Kryszkiewicz 
[3,  4],  Slowihski  and  Stefanowski  [7].  Our  paper  enhances  such  works  by  distin¬ 
guishing  two  different  semantics  for  the  incomplete  information:  the  “missing” 
semantics  (unknown  values  allow  any  comparison)  and  the  “absent”  semantics 
(unknown  values  do  not  allow  any  comparison)  and  explores  three  different  for¬ 
malisms  to  handle  incomplete  information  tables:  tolerance  relations,  non  sym¬ 
metric  similarity  relations  and  valued  tolerance  relations. 
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The  paper  is  organized  as  follows.  In  section  2  we  discuss  the  tolerance  ap¬ 
proach  introduced  by  Kryszkiewicz  [3].  Moreover,  we  give  an  example  of  incom¬ 
plete  information  table  which  will  be  used  all  along  the  paper  in  order  to  help 
the  understanding  of  the  different  approaches  and  allow  comparisons.  In  section 
3  an  approach  based  on  non  symmetric  similarity  relations  is  introduced  using 
some  results  obtained  by  Slowihski  and  Vanderpooten  [8].  We  also  demonstrate 
that  the  non  symmetric  similarity  approach  refines  the  results  obtained  using  the 
tolerance  relation  approach.  Finally,  in  section  4  a  valued  tolerance  approach  is 
introduced  and  discussed  as  an  intermediate  approach  among  the  two  previous 
ones.  Conclusions  are  given  in  the  last  section. 


2  Tolerance  relations 

In  the  following  we  briefly  present  the  idea  introduced  by  Kryszkiewicz  [3].  In 
our  point  of  view  the  key  concept  introduced  in  this  approach  is  to  associate  to 
the  unavailable  values  of  the  information  table  a  “null”  value  to  be  considered 
as  “everything  is  possible”  value.  Such  an  interpretation  corresponds  to  the  idea 
that  such  values  are  just  “missing” ,  but  they  do  exist.  In  other  words,  it  is  our 
imperfect  knowledge  that  obliges  us  to  work  with  a  partial  information  table. 
Each  object  potentially  has  a  complete  description,  but  we  just  miss  it  for  the 
moment.  More  formally,  given  an  information  table  IT  =  {A,C),  a  subset  of 
attributes  B  C  C  we  denote  the  missing  values  by  *  and  we  introduce  the 
following  binary  relation  T: 

'^x,yeAxA  T{x,y)  Vcj  €  B  Cj{x)  =  Cj{y)  or  Cj{x)  =  ♦  or  Cj{y)  =  * 

Clearly  T  is  a  reflexive  and  symmetric  relation,  but  not  necessarily  transitive. 
We  call  the  relation  T  a  “tolerance  relation”.  Further  on  let  us  denote  by  Ib{^) 
the  set  of  of  objects  y  for  which  T{x,  y)  holds  taking  into  account  attributes  B. 
We  call  such  a  set  the  “tolerance  class  of  x”,  thus  allowing  the  definition  of  a 
set  of  tolerance  classes  of  the  set  A.  We  can  now  use  the  tolerance  classes  as 
the  basis  for  redefining  the  concept  of  lower  and  upper  approximation  of  a  set  ^ 
using  the  set  of  attributes  B  CC.  We  have: 

^  A\Ib{^)  ^  the  lower  approximation  of  ^ 

=  {a;  G  A|/b(x)  n  #  0}  the  upper  approximation  of  ^ 

It  is  easy  to  observe  that  =  U{-^(a^)k  ^  introduce  now  an 

example  of  incomplete  information  table  which  will  be  further  used  in  the  paper. 
Example  1.  Suppose  the  following  information  table  is  given 


Eil 
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EH 

EH 

EH 
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EH 

EH 

EH!I 

EIH 

BH 
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B 
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0 

0 

B 

where  ai,  ....,  a\2  are  the  available  objects,  ci,  ....,  C4  are  four  attributes  which 
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values  (discrete)  range  from  0  to  3  and  d  is  a  decision  attribute  classifying  objects 
either  to  the  set  ^  or  to  the  set 

Using  the  tolerance  relation  approach  to  analyse  the  above  example  we 
have  the  following  results:  lc{0'\)  =  {^i, 011,^12},  Icip.2)  =  {a2, as},  Ic{as)  = 
{a2,03}»  -^0(^4)  =  {a4,a5,aio,aii,ai2}i  Idas)  =  {^4, 05,^10,011, 012},  /c(o6)  = 
{oe},  ^^(07)  =  {07,08,09,011,012},  Idas)  =  {o7,08,oio},  /cCog)  =  {07,09,011, 
012},  Idaio)  =  {04,05,08,010,011},  /c(oii)  =  {01,04,05,07,09,010,011,012}, 
Idai2)  =  {01,04,05,07,09,011,012}.  Prom  which  we  can  deduce  that:  —  0, 

0^  =  {01,02,03,04,05,07,08,09,010,011,012},  =  {oe},  =  A 

The  results  are  quite  poor.  Moreover  there  exist  elements  which  intuitively 
could  be  classified  in  ^  or  in  S',  while  they  are  not.  Talce  for  instance  oi.  We 
have  complete  knowledge  about  it  and  intuitively  there  is  no  element  perceived 
as  similar  to  it.  However,  it  is  not  in  the  lower  approximation  of  0.  This  is  due  to 
“missing  values”  of  on  and  012  which  enables  them  to  be  considered  as  “similar” 
to  oi.  Of  course  this  is  “safe”  because  potentially  the  two  objects  could  come  up 
with  exactly  the  same  values  of  fli. 

A  reduct  is  defined  similarly  as  in  the  “classical”  rough  set  the  same  model, 
i.e.  it  is  a  minimal  subset  of  attributes  that  preserves  lower  approximations  of 
object  classification  as  for  all  attributes  .  In  Example  1,  the  set  of  attributes 
{ci,C2,C4}  is  the  only  reduct.  Kryszkiewicz  [3]  discussed  the  generation  of  deci¬ 
sion  rules  from  incomplete  information  tables.  She  considered  mainly  generalized 
decision  rules  of  the  form  Ai(ci,  v)— >V(d,  w).  If  the  decision  part  contains  one  dis¬ 
junct  only,  the  rule  is  certain.  Let  be  a  set  of  condition  attributes  which  occur 
in  a  condition  part  of  the  rule  $  -^  t.  A  decision  rule  is  true  if  for  each  object 
X  satisfying  condition  part  s,  Ib{x)  C  [t].  It  is  also  required  that  the  rule  must 
have  non-redundant  condition  part.  In  our  example,  we  can  find  only  one  certain 
decision  rule:  (ci  =  2)A(c2  =  3)A(c4  =  l)~^{d  =  0). 


3  Similarity  Relations 

We  introduce  now  a  new  approach  based  on  the  concept  of  a  not  necessarily 
symmetric  similarity  relation.  Such  a  concept  has  been  first  introduced  in  general 
rough  set  theory  by  Slowinski  and  Vanderpooten  [8]  in  order  to  enhance  the 
concept  of  indiscernability  relation.  We  first  introduce  what  we  call  the  “absent 
values  semantics”  for  incomplete  information  tables.  In  this  approach  we  consider 
that  objects  may  be  described  “incompletely”  not  only  because  of  our  imperfect 
knowledge,  but  also  because  definitely  impossible  to  describe  them  on  all  the 
attributes.  Therefore  we  do  not  consider  the  unknown  values  as  uncertain,  but 
as  “non  existing”  and  we  do  not  allow  to  compare  imknown  values. 

Under  such  a  perspective  each  object  may  have  a  more  or  less  complete 
description,  depending  on  how  many  attributes  has  been  possible  to  apply.  Prom 
this  point  of  view  an  object  x  can  be  considered  similar  to  another  object  y  only  if 
they  have  the  same  known  values.  More  formally,  denoting  as  usual  the  unknown 
value  as  *,  given  an  information  table  IT  =  {A,C)  and  a  subset  of  attributes 


76 


B  CC  we  introduce  a  similarity  relation  S  as  follows: 

Vx,2/  S{x,y)  \fcj  e  B  :  Cj{x)  ^  Cj{x)  -  Cj{y) 

It  is  easy  to  observe  that  such  a  relation  although  not  symmetric  is  transitive. 
The  relation  5  is  a  partial  order  on  the  set  A.  Actually  it  can  be  seen  as  a 
representation  of  the  inclusion  relation  since  we  can  consider  that  "a:  is  similar 
to  y”  iff  the  ‘the  description  of  x''  is  included  in  “the  description  ofy”.  We  can 
define  for  any  object  x  E  A  two  sets: 

R{x)  =  {y  e  A\S{y,  x)}  the  set  of  objects  similar  to  x 

=  {y  e  A\S{x,y)}  the  set  of  objects  to  which  x  is  similar 
Clearly  i?(a:)  and  are  two  different  sets.  We  can  now  define  for  the 

lower  and  upper  approximation  of  a  set  ^  as  follows: 

=  {x  6  A\R~^{x)  C  the  lower  approximation  of  ^ 

=  |J{i^(x)|x  €  ^}  the  upper  approximation  of  ^ 

In  other  terms  we  consider  as  surely  belonging  to  ^  all  objects  which  have 
objects  similar  to  them  belonging  to  On  the  other  hand  any  object  which  is 
similar  to  an  object  in  #  could  potentially  belong  to  0.  Comparing  our  approach 
with  the  tolerance  relation  based  one  we  can  state  the  following  result. 

Theorem  1.  Given  an  information  table  IT  =  (A,  C)  and  a  set  0,  the  upper  and 
lower  approximations  of  ^  obtained  using  a  non  symmetric  similarity  relation 
are  a  refinement  of  the  ones  obtained  using  a  tolerance  relation. 

Proof.  Denote  as  the  lower  approximation  of  ^  using  the  tolerance  ap¬ 
proach  and  the  lower  approximation  of  ^  using  the  similarity  approach, 
and  being  the  upper  approximations  respectively.  We  have  to  demonstrate 
that:  0%  C  and  C  0^.  Clearly  we  have  that:  \/x,y  S{x,y)-^T{x,y) 
since  the  conditions  for  which  the  relation  S  holds  are  a  subset  of  the  conditions 
for  which  the  relation  T  holds.  Then  it  is  easy  to  observe  that:  Vx  i^(x)  C 
/(x)  and  R-^{x)  C  I{x). 

1.  0“^  C  By  definition  =  {x  €  A|7(x)  C  0}  and^f  =  {x  G  A\R  ^(x)  C 

Therefore  if  an  object  x  belongs  to  0'^  we  have  that  Ib{^)  ^  0  ^^d  since 
Q  I{x)  we  have  that  R~'^(x)  C  0  and  therefore  the  same  object  x 
will  belong  to  0%.  The  inverse  is  not  always  true.  Thus  the  lower  approxi¬ 
mation  of  0  using  the  non  S3anmetric  similarity  relation  is  at  least  as  rich  as 
the  lower  approximation  of  0  using  the  tolerance  relation. 

2.  0^  C  0^.  By  definition  =  U^e^Rix)  and  0^  =  Uxe^H^)  since 
jR(x)  C  /(x)  the  union  of  the  sets  i?:(x)  will  be  a  subset  of  the  union  of  the 
sets  /(x).  The  inverse  is  not  always  true.  Therefore  the  upper  approximation 
of  0  using  the  non  symmetric  similarity  relation  is  at  most  as  rich  as  the 
upper  approximation  of  0  using  the  tolerance  relation. 

Continuation  of  Example  1.  Let  us  come  back  to  the  example  introduced 
in  section  1.  Using  all  attributes  C  we  have  the  following  results:  = 

{ai},  R{ai)  =  {ai,aii,ai2},  =  {^2,03})  =  {«2,a3},  i^^^(a3)  = 

{02,03},  72(03)  =  —  {<^4}  <^5} )  7^(04)  =  {04, 05,  Oil},  (<^5)  — 

{04,05},  72(05)  =  {04,05,011},  .R“^(o6)  =  {ae})  72(05)  =  {oe},  R  (07)  = 
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{07,^9},  RM  =  {ar},  R  =  {as},  -^(as)  =  {ag},  R  =  (ag), 
R{ag)  =  (ay, 09,011,012},  R~^{aio)  =  {aio},  i2(oio)  =  {oio},  R~^{an)  = 
{01,04,05,09,011,012},  i^(oii)  =  {oii},  i?“^(oi2)  =  {01,09,012},  i?(oi2)  = 
{oii ,  012}.  Prom  which  we  can  deduce  that:  =  {ai,  oio},  =  {01,02, 03, 04, 

05,07,010,011,012},  =  {a6,a8,a9},  =  {02,03,04,05,06,07,08,09,011,012} 

The  new  approximations  are  more  informative  than  the  tolerance  based  ones. 
Moreover,  we  find  now  in  the  lower  approximations  of  the  sets  ^  and  ^  some 
of  the  objects  which  intuitively  we  were  expecting  to  be  there.  Obviously  such 
an  approach  is  less  “safe”  than  the  tolerance  based  one,  since  objects  can  be 
classified  as  “surely  in  although  very  little  is  known  about  them  (e.g.  object 
aio).  However,  under  the  “absent  values”  semantic  we  do  not  consider  a  partially 
described  object  as  “little  known” ,  but  as  “known”  just  on  few  attributes. 

The  subset  C"  of  C  is  a  reduct  with  respect  to  a  classification  if  it  is  min¬ 
imal  subset  of  attributes  C  that  keeps  the  same  lower  approximation  of  this 
classification.  We  observe  that  according  to  definition  of  the  relation  an  object 
“totally  unknown”  (having  in  all  attributes  an  unknown  value)  is  not  similar  to 
any  other  object.  If  we  eliminate  one  or  more  attributes  which  will  make  an  ob¬ 
ject  to  become  “totally  unknown”  on  the  remaining  attributes  we  lose  relevant 
information  for  the  classification.  We  can  conclude  that  all  such  attributes  have 
to  be  in  the  reducts.  In  example  1  there  is  one  reduct  {ci, C2,  C4}  -  it  leads  to  the 
same  classes  R~^{x)  and  R{x)  as  using  all  attributes. 

The  decision  rule  is  defined  as  (where  s  =  Ai{ci^v)  and  t  =  (d,  w)).  The 
rule  is  true  if  for  each  object  x  satisfying  s,  its  class  R{x)  C  [t].  The  condition 
part  cannot  contain  redundant  conditions. 

In  example  1,  the  following  certain  decision  rules  can  be  generated: 

(ci  =  1)  — >  (d  =  ^),  (c3  =  1)A(c4  =  0)  — ^  (d  =  ^),  (ci  =  3)A(c4  =  0)  — >  (d  = 

(C2  =  3) A(c4  =  1)  ^  (d  =  1^^) ,  (C2  =  0)  ^  (d  =  ^^) ,  (C3  =  0)  (d  = 

The  absent  value  semantics  gives  more  informative  decision  rules  than  tolerance 
based  approach.  Nevertheless  these  two  different  approaches  (the  tolerance  and 
the  non  symmetric  similarity)  appear  to  be  two  extremes,  in  the  middle  of  which 
it  could  be  possible  to  use  a  more  fiexible  approach. 

4  Valued  tolerance  relations 

Going  back  to  the  example  of  section  2,  let’s  consider  the  elements  ai,  au  and 
ai2.  Under  both  the  tolerance  relation  approach  and  the  non  symmetric  similar¬ 
ity  relation  approach  we  have:  r(aii,ai),T(ai2,ai),S(aii,ai),5(ai2,ai).  How¬ 
ever  we  may  desire  to  express  the  intuitive  idea  that  ai2  is  “more  similar”  to  ai 
than  ail  or  that  an  is  “less  similar”  to  ai  than  ai2.  This  is  due  to  the  fact  that 
in  the  case  of  ai2  only  one  value  is  unknown  and  the  rest  all  are  equal,  while  in 
the  case  of  an  only  one  value  is  equal  and  the  rest  are  unknown.  We  may  try  to 
capture  such  a  difference  using  a  valued  tolerance  relation. 

The  reader  may  notice  that  we  can  define  different  types  of  valued  tolerance 
(or  similarity)  using  different  comparison  rules.  Moreover  a  valued  tolerance  (or 
similarity)  relation  can  be  defined  also  for  complete  information  tables.  Actually 
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the  approach  we  will  present  is  independent  from  the  specific  formula  adopted 
for  the  valued  tolerance  and  can  be  extended  to  any  type  of  valued  relation. 

Given  a  valued  tolerance  relation  for  each  element  of  A  we  can  define  a  “toler¬ 
ance  class”  that  is  a  fuzzy  set  with  membership  function  the  “tolerance  degree” 
to  the  reference  object.  It  is  easy  to  observe  that  if  we  associate  to  the  non  zero 
tolerance  degree  the  value  1  we  obtain  the  tolerance  classes  introduced  in  section 
2.  The  problem  is  to  define  the  concepts  of  upper  and  lower  approximation  of  a 
set  Given  a  set  ^  to  describe  and  a  set  Z  C  A  we  will  try  to  define  the  degree 
by  which  Z  approximates  from  the  top  or  from  the  bottom  the  set  Under 
sudi  a  perspective,  each  subset  of  A  may  be  a  lower  or  upper  approximation  of 

but  to  different  degrees.  For  this  purpose  we  need  to  translate  in  a  functional 
representation  the  usual  logical  connectives  of  negation,  conjunction  etc..: 

1.  A  negation  is  a  function  N  :  [0, 1]  >  [0, 1],  such  that  N{0)  =  1  and  A^(l)  =  0. 
An  usual  representation  of  the  negation  is  N{x)  =  1  —  x. 

2.  A  T-norm  is  a  continuous,  non  decreasing  fxmction  T  :  [0, 1]^  [0, 1]  such 

that  T(x,  1)  =  X.  Clearly  a  T-norm  stands  for  a  conjimction.  Usual  representa¬ 
tions  of  T-norms  are:  the  min:  T{x^y)  =  min(x,2/);  the  product:  T{x^y)  =  xy\ 
the  Lukasiewicz  T-norm:  T{x^  y)  =  max(a;  -|-  j/  -  1, 0). 

3.  A  T-conorm  is  a  continuous,  non  decreasing  function  S  :  [0,1]^  [0,1] 

such  that  5(0,2/)  =  y.  Clearly  a  T-conorm  stands  for  a  disjunction.  Usual 
representations  of  T-conorms  are:  the  max:  S{x^y)  =  max(a:,  ?/);  the  product: 
S{x,  y)  =  x-\-y~xy;  the  Lukasiewicz  T-conorm:  S{x,  y)  =  min(a:  +  2/,  !)• 

If  S{x,y)  =  N{T{N{x),  N{y)))  we  have  the  equivalent  of  the  De  Morgan  law 
and  we  call  the  triplet  {A,  T,  5)  a  De  Morgan  triplet.  I{x,  y),  the  degree  by  which 
x  may  imply  y  is  again  a  function  I :  [0,1]^  i— ^  [Oj  !]•  However,  the  definition  of 
the  properties  that  such  a  function  may  satisfy  do  not  make  the  imanimity.  Two 
basic  properties  may  be  desired:  the  first  claiming  that  I{x,y)  =  S{N{x),y) 
translating  the  usual  logical  equivalence  x^y=def^^yy\  the  second  claiming 
that  whenever  the  truth  value  of  x  is  not  greater  than  the  truth  value  of  y ,  then 
the  imphcation  should  be  true  {x  <y  ^  I{x^y)  =  1).  It  is  almost  impossible  to 
satisfy  both  the  two  properties.  In  the  very  few  cases  where  this  happens  other 
properties  are  not  satisfied  (for  a  discussion  see  [1]). 

Coming  back  to  our  lower  and  upper  approximations  we  know  that  given  a 
set  Z  C  A,  a  set  #  and  attributes  J5  C  C  the  usual  definitions  are: 
h  z  =  ^B  y  zeZ,  eiz)  c  2.  z  =  ^  \/zeZ,  0{z)  n  ^  ^  0 

0{z)  being  the  “indiscernability  (tolerance,  similarity  etc.)”  class  of  element  2. 
The  fimctional  translation  of  such  definitions  is  straightforward.  Having: 

V  X  (pix)  =def  T,,(P{x);  3  X  ^(x)  ^def  5a:^(x);  =def 

^  n  I?'  ^  0  -def  3  X  0(x)A'^(x)  =de/  5x(T(/i#(x),  lJ>^{x)))  we  get: 

1. /i#^(Z)  =  Ta;€^(Ta;ge(;j;)(7(ii(2?,x),  x))), 

2. fi^B{Z)  ~  Tsgz(^x€©(«)(^(-^(^’ ^)> ^)))>  .  . 

where:  fJ-^eiZ)  is  the  degree  for  set  Z  to  be  a  lower  approximation  of  fj,^B{Z) 
is  the  degree  for  set  Z  to  be  an  upper  approximation  of  0(z)  is  the  tolerance 
class  of  element  T,  5, 1  are  the  functions  previously  defined;  R{z,  x)  is  the 
membership  degree  of  element  x  in  the  tolerance  class  of  z;  x  is  the  membership 
degree  of  element  x  in  the  set  ^  (^  €  {0, 1}). 
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Continuation  of  Example  1.  Considering  that  the  set  of  possible  values 
on  each  attribute  is  discrete  we  make  the  hypothesis  that  there  exists  a  tmi- 
form  probability  distribution  among  such  values.  More  formally,  consider  Cj 
an  attribute  of  an  information  table  IT  —  {A,  C)  and  associate  to  it  the  set 
Ej  =  {ej,  •  •  •  e^}  of  all  its  possible  values.  Given  an  element  x  e  A  the  prob¬ 
ability  that  Cj{x)  =  e*-  is  l/\Ej\.  Therefore  given  any  two  elements  a:,  1/  e  A 
and  an  attribute  Cj,  if  Cj{y)  =  eJ,  the  probability  Rj{x^y)  that  x  is  similar  to 
y  on  the  attribute  Cj  is  l/\Ej\.  On  this  basis  we  can  compute  the  probability 
that  two  elements  are  similar  on  the  whole  set  of  attributes  as  the  joint  prob¬ 
ability  that  the  values  of  the  two  elements  are  the  same  on  all  the  attributes: 
R{x,  y)  =  Ylc  ec  ^3  v)'  Applying  this  rule  to  objects  we  obtain  the  following 

table  1  concerning  the  valued  tolerance  relation. 
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Table  1:  Valued  tolerance  relation  for  Example  1. 


If  we  consider  element  ai,  the  valued  tolerance  relation  R{ai^x),  x  E  A  will 
result  in  the  vector  [1,0,0,0,0,0,0,0,0,0,1/64,1/4]  which  actually  represents 
the  tolerance  class  0{ai)  of  element  ai.  The  reader  may  notice  that  the  crisp 
tolerance  class  of  element  ai  was  the  set  {01,011,012}  which  corresponds  to 
the  vector  [1,0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1].  Following  our  “probabilistic  approach”  we 
may  choose  for  T  and  S  the  product  representation,  while  for  I  we  will  satisfy 
the  De  Morgan  property  thus  obtaining:  T(x,i/)  =  xy^  S(xjy)  =  x  y  -  xy, 
I{xjy)  =  1  —  X  +  xy.  Clearly  our  choice  of  I{x,y)  does  not  satisfy  the  second 
property  of  implication.  However,  the  reader  may  notice  that  in  our  specific  case 
we  have  a  peculiar  implication  from  a  fuzzy  set  {0{z))  to  a  regular  set  (^), 
such  that  X  E  {0, 1}.  The  application  of  any  implication  satisfying  the  second 
property  will  reduce  the  valuation  to  the  set  {0, 1}  and  therefore  the  whole  degree 
110 q{Z)  will  collapse  to  (0, 1}  and  thus  to  the  usual  lower  approximation.  With 
such  considerations  we  obtain: 

(^)  ~  Ylzez  nx€0(«)(^  ~ 

=  UzezO-  -  nx6e(z)(l  - 

Consider  now  the  set  ^  and  as  set  Z  consider  the  element  ai,  where  R{ai^x) 
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was  previously  introduced  and  x  takes  the  values  [1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1].  We 
obtain  —  0.98  and  —  1.  Operationally  we  could  choose  a  set  Z 

as  lower  (upper)  approximation  of  set  ^  as  follows: 

1.  take  all  elements  for  which  =  1  n  ^)  =  1); 

2.  then  add  elements  in  a  way  such  that  ^{B{z)~-^b)  >  k  {^{B{z)  H  >  k), 
(for  decreasing  values  of  A:,  let’s  say  0.99,  0.98  etc.),  thus  obtaining  a  family 
of  lower  (upper)  approximations  with  decreasing  membership  fimction  {Z) 

3.  fix  a  minimum  level  A  enabling  to  accept  a  set  Z  as  a  lower  (upper)  approxi¬ 
mation  of  ^  (thus  fj,0g{Z)  >  A). 

The  concept  of  reduct  and  decision  rules  are  also  generalized  in  the  valued  tol¬ 
erance  case.  Given  the  decision  table  {A,  C)  and  the  partition  T  =  ?  ^2,  •  •  • 

the  subset  of  attributes  C"  C  C  is  a  reduct  iff  it  does  not  decease  the  degree  of 
lower  approximation  obtained  with  C,  i.e.  if  is  a  family  of  lower 

approximations  of  ^i, ^2>  •  •  •  then  Vi=i . n^i  i^i)  ^  (^»)- 

In  order  to  induce  classification  rules  from  the  decision  table  on  hand  we  may 
accept  now  rules  with  a  “credibility  degree”  derived  from  the  fact  that  objects 
may  be  similar  to  the  conditional  part  of  the  rule  only  to  a  certain  degree,  besides 
the  fact  the  implication  in  the  decision  part  is  also  uncertain.  More  formally  we 
give  the  following  representation  for  a  rule  pi’.  p{  =def  ~v)  {d  =  w) 

where:  J  C  (7,  i;  is  the  value  of  attribute  Cj,  w  is  the  value  of  attribute  d. 

As  usual  we  may  use  relation  s{x^pi)  in  order  to  indicate  that  element  x 
“supports”  rule  pi  or  that,  x  is  similar  to  some  extend  to  the  condition  part  of 
rule  pi.  We  denote  as  S{pi)  =  {x  :  s{x,pi)  >  0}  and  as  W  —  {x  :  d{x)  =  w}. 
Then  pi  is  a  decision  rule  iff :  V  x  €  S{pi)  :  B{x)  C  W.  We  can  compute 
a  credibility  degree  for  any  rule  pi  calculating  the  truth  value  of  the  previous 
formula  which  can  be  rewritten  as:  V  x^y  s{x,pi)~^{R{x,y)^W{y)).  We  get: 
Kpi)  =  T:,{Iy(s{x,pi),I{fie(a:){y).f^w{y))))  •  Finally  it  is  iiecessary  to  check 
whether  J  is  a  non-redundant  set  of  conditions  for  rule  pi,  i.e.  to  look  if  it  is 
possible  to  satisfy  the  condition:  3  J  C  J  :  p>{Pi)  >  KPi) 

Continuation  of  Example  1.  Consider  again  the  incomplete  table  and  take  as 
candidate  the  rule:  pi  :  (ci  =  3)A(c2  =  2)A(c3  =  1)A(c4  =  0)— »(d  =  ^).  Since 
in  the  paper  we  have  chosen  for  the  functional  representation  of  implication  the 
satisfaction  of  De  Morgan  law  and  for  T-norms  the  product,  we  get: 

KPi)  =  P''>  +  Uyee{x)i^  -  Pe(x){y)  +  Pe(x){y)m{y))) 

where  s{x,pi)  represents  the  “support”  degree  of  element  x  to  the  rule  pi.  We 
thus  get  that  pi{p{)  =  0.905.  However,  the  condition  part  of  rule  pi  is  redundant 
and  is  transformed  to:  pi  :  (ci  =  3)A(c3  =  1)A(c4  =  0)— >(d  =  #)  with  degree 
p{pi)  —  0.905.  This  rule  is  supported  by  objects  S{pi)  =  {ai,aii,ai2}.  For  the 
set  we  have  one  rule:  p2  :  (ci  =  2)A(c2  =  3)A(c4  =  l)-^(d  =  with  degree 
^(p2)  =  1.0  and  a  supporting  object  aQ. 

Operationally  a  user  may  first  fix  a  threshold  of  credibility  for  the  rules  to 
accept  and  then  could  operate  a  sensitivity  analysis  on  the  set  of  rules  that  is 
s  possible  to  accept  in  an  interval  of  such  threshold. 
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5  Conclusions 

Rough  set  theory  has  been  conceived  under  the  implicit  hypothesis  that  all  ob¬ 
jects  in  a  universe  can  be  evaluated  under  a  given  set  of  attributes.  However, 
it  can  be  the  case  that  several  values  are  not  available  for  various  reasons.  In 
our  paper  we  introduce  two  different  semantics  in  order  to  distinguish  such  sit¬ 
uations.  “Missing  values”  imply  that  non  available  information  could  always 
become  available  and  that  in  order  to  make  “safe”  classifications  and  rules  in¬ 
duction  we  might  consider  that  such  missing  values  are  equal  to  everything. 
Tolerance  relations  (which  are  reflexive  and  symmetric,  but  not  transitive)  cap¬ 
ture  in  a  formal  way  such  an  approach.  “Absent  values”  imply  that  not  available 
information  cannot  be  used  in  comparing  objects  and  that  classification  and 
rules  induction  should  be  performed  with  the  existing  information  since  the  ab¬ 
sent  values  could  never  become  available.  Similarity  relations  (which  in  our  case 
are  reflexive  and  transitive,  but  not  symmetric)  are  introduced  in  our  paper  in 
order  to  formalize  such  an  idea.  We  demonstrate  in  the  paper  that  om  approach 
always  lead  to  more  informative  results  with  respect  to  the  tolerance  relation 
based  approach  (although  less  safe). 

A  third  approach  is  also  introduced  in  the  paper,  as  an  intermediate  position 
among  the  two  previously  presented.  Such  an  approach  is  based  on  the  use  of  a 
valued  tolerance  relation.  A  valued  relation  could  appear  for  several  reasons  not 
only  because  of  the  non  available  information  and  in  fact  the  approach  presented 
has  a  more  general  vahdity.  However  in  this  paper  we  limit  ourselves  in  discussing 
the  missing  values  case.  A  functional  extension  of  the  concepts  of  upper  and  lower 
approximation  is  introduced  in  this  paper  so  that  to  any  subset  of  the  universe 
a  degree  of  lower  (upper)  approximation  can  be  associated.  Further  on  such  a 
functional  extension  enables  to  compute  a  credibihty  degree  for  any  decision  rule 
induced  by  the  classification. 
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Abstract.  Another  formulation  of  the  notion  of  rough  relations  is  pre¬ 
sented.  Instead  of  using  two  equivalence  relations  on  two  universes,  or  a 
joint  eqtii valence  relation  on  their  Cartesian  product,  we  start  from  spe¬ 
cific  clcisses  of  binary  relations  obeying  certain  properties.  The  chosen 
class  of  relations  is  a  subsystem  of  all  binary  relations  and  represents 
relations  we  are  interested.  An  arbitrary  relation  is  approximated  by  a 
pair  of  relations  in  the  chosen  class. 


1  Introduction 

The  theory  of  rough  sets  is  built  on  partitions  of  the  universe  defined  by  equiva¬ 
lence  relations  [6,  16].  A  partition  of  the  uninverse  represents  a  granulated  view 
of  the  universe,  in  which  equivalence  classes  are  considered  to  be  basic  granules. 
It  is  assumed  that  information  is  available  for  only  the  basic  granules.  One  has  to 
consider  each  equivalence  class  as  a  whole  instead  of  individual  elements  of  the 
universe.  For  inferring  information  about  an  arbitrary  subset  of  the  universe,  it 
is  necessary  to  consider  its  approximations  by  equivalence  classes.  More  specif¬ 
ically,  a  set  is  described  by  a  pair  of  lower  and  upper  approximations.  From 
existing  studies  of  rough  sets,  we  can  identify  at  least  two  formulations,  the  par¬ 
tition  based  method  and  subsystem  based  method  [14,  15].  In  partition  based 
approach,  the  lower  approximation  is  the  union  of  equivalence  classes  contained 
in  the  set,  and  the  upper  approximation  is  the  union  of  equivalence  classes  hav¬ 
ing  a  nonempty  intersection  with  the  set.  In  subsystem  based  approach,  one  can 
use  equivalence  classes  as  basic  building  blocks  and  construct  a  subsystem  of  the 
power  set  by  taking  unions  of  equivalence  classes.  The  constructed  subsystem 
is  in  fact  an  tr-algebra  of  subsets  of  the  universe.  That  is,  it  contains  both  the 
empty  set  and  the  entire  set,  and  is  closed  under  set  intersection  and  union.  The 
lower  approximation  is  the  largest  subset  in  the  subsystem  that  is  contained  in 
the  set  to  be  approximated,  and  the  upper  approximation  is  the  smallest  subset 
in  the  subsystem  that  contains  the  set  to  be  approximated.  Each  of  the  two 
formulations  captures  different  aspects  of  rough  set  approximations.  They  can 
be  used  to  obtain  quite  distinctive  generalizations  of  rough  set  theory  [15,  17]. 

A  binary  relation  is  a  set  of  pairs,  i.e.,  a  subset  of  the  Cartesian  product  of 
two  universes.  It  is  therefore  very  natural  to  generalize  rough  sets  to  the  notion 
of  rough  relations.  The  majority  of  existing  studies  on  rough  relations  is  relied  on 
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partition  based  approach.  It  involves  two  equivalence  relations  on  two  universes, 
or  a  joint  equivalence  relation  on  their  Cartesian  product.  This  straightforward 
definition  of  rough  relations  was  proposed  by  Pawlak  [7,  9].  Generalizations  of 
rough  relations,  along  the  same  line  of  argument,  have  been  made  by  Diintsch  [3], 
Stepaniuk  [11,  12,  13],  and  Skowron  and  Stepaniuk  [10].  An  implication  of  the 
partition  based  formulation  is  that  the  properties  of  lower  and  upper  approxi¬ 
mations  depend  on  the  relation  to  be  approximated.  Although  a  binary  relation 
is  a  set  of  pair,  it  is  set  equipped  with  additional  properties,  such  as  reflexivity, 
symmetry,  and  transitivity.  The  added  information  provided  by  binary  relations 
is  not  fully  explored  in  many  studies  of  rough  relations.  For  some  applications, 
we  may  only  be  interested  in  approximating  a  relation  in  terms  of  relations  with 
special  properties  [4].  The  subsystem  based  approach  may  be  useful,  as  one  can 
choose  the  subsystem  so  that  all  relations  in  the  subsystem  have  some  desired 
properties.  Greco  et  al  [4]  implicitly  used  subsystem  based  approach  for  the 
approximation  of  preferential  information. 

The  main  objective  of  this  paper  is  to  present  an  alternative  formulation 
of  rough  relations  by  extending  the  subsystem  based  method.  In  Section  2,  we 
review  two  formulations  of  rough  set  approximations.  In  Section  3,  a  subsystem 
based  formulation  of  rough  relations  is  introduced.  Special  types  of  subsystems 
are  used  for  defining  rough  relation  approximations.  This  study  is  complemen¬ 
tary  to  existing  studies,  and  the  results  may  provide  more  insights  into  the 
understanding  and  applications  of  rough  relations. 

2  Two  Formulations  of  Rough  Set  Approximations 

Let  E  CU  xU  denote  an  equivalence  relation  on  a  finite  and  nonempty  universe 
U,  where  U  x  U  =  is  the  Cartesian  product  of  U.  That  is,  E  is  reflexive, 
symmetric,  and  transitive.  The  pair  apr  =  {U,  E)  is  referred  to  as  a  Pawlak  ap¬ 
proximation  space.  The  equivalence  relation  E  partitions  U  into  disjoint  subsets 
known  as  equivalence  classes.  That  is,  E  induces  a  quotient  set  of  the  universe 
U,  denoted  by  U/E.  Equivalence  classes  are  called  elementary  sets.  They  are  in¬ 
terpreted  as  basic  observable,  measurable,  or  definable  subsets  of  U.  The  empty 
set  0  and  a  union  of  one  or  more  elementary  sets  are  interpreted  as  composite 
ones.  The  family  of  all  such  subsets  is  denoted  by  Def(17).  It  defines  a  topology 
space  (U,  Def((7))  in  which  Def(?7),  a  subsystem  of  the  power  set  of  Uy  consists 
of  both  closed  and  open  sets.  Two  formulations  of  rough  sets  can  be  obtained 
by  focusing  on  the  partition  U/E  and  the  topology  Def(?7),  respectively. 

An  arbitrary  subset  X  CU  is  approximated  by  a  pair  of  subsets  of  U  called 
lower  and  upper  approximations,  or  simply  a  rough  set  approximation  [6].  The 
lower  approximation  apr[X)  is  the  union  of  all  elementary  sets  contained  in  X , 
and  the  upper  approximation  apr[X)  is  the  union  of  all  elementary  sets  which 
have  a  nonempty  intersection  with  X.  They  are  given  by: 

(defl)  apr{X)  =  \x  eUy  [x]e  C  A}, 

apr(X)  =  (J{[x]b  \xeU,  [x]b  n  X  /  0}, 
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where  [x]e  denotes  the  equivalence  class  containing  x: 

[®]b  =  {?/ 1  xEy,  X,ye  U}.  (1) 

For  rough  set  approximations,  we  have  the  following  properties: 

(LI)  aprjX  nY)  =  aprjX)  D  apr{Y), 

(L2)  apr(X)  C  X, 

(L3)  apr{X)  -  apr{apr{X)), 

(L4)  aprjX)  =  apr(apr{X)), 

and 

(Ul)  Wr(X  U  y )  =  apriX)  U  Wr{Y) , 

(U2)  X  C  WriX), 

(U3)  a:pr{X)  =  apr(apr(X)}, 

(U4)  apr{X)  =  apr{apr{X)), 

The  two  approximations  are  dual  to  each  other  in  the  sense  that  apr{—X)  — 

-'^(X)  and  apr(-X)  =  -apr(X).  The  properties  with  the  same  number  may 
be  considered  as  dual  properties.  It  is  possible  to  compute  the  lower  approxi¬ 
mation  of  X  n  y  based  on  the  lower  approximations  of  X  and  Y.  However,  it 
is  impossible  to  compute  the  upper  approximation  of  X  C\Y  based  on  the  up¬ 
per  approximations  of  X  and  Y.  Similar  observation  can  also  be  made  for  the 
approximations  of  X  U  y . 

By  the  properties  of  rough  set  approximations,  apr(X)  is  indeed  the  greatest 
definable  set  contained  in  X,  apr(X)  is  the  least  definable  set  containing  X.  The 
following  equivalent  definition  can  be  used  [6,  14]: 

(def2)  apr(X)  =  U{y  |  Y  G  Def((7),  Y  C  X}, 
apfiX)  =  p|{y  I  y  G  Def(!7),X  C  y}. 

For  a  subset  X  G  Def(f/),  we  have  X  =  apr{X)  =  qpr(Z).  Thus,  we  can  say 
that  subsets  in  Def(t/)  have  exact  representations.  For  other  subsets  of  U ,  both 
lower  and  upper  approximations  do  not  equal  to  the  set  itself,  which  leads  to 
approximate  representations  of  the  set.  It  should  be  clear  by  now  the  reason 
for  calling  elements  of  Def(U)  definable  sets.  Mathematically  speaking,  subsets 
in  Def(C/)  may  be  considered  as  fixed  points  of  approximation  operators  apr 
and  apf.  Every  other  element  is  approximated  using  the  fixed  points.  That  is, 
apr{X)  is  the  best  approximation  of  X  from  below,  and  apr{X)  is  the  best 
approximation  of  X  from  above. 

Although  both  definitions  are  equivalent,  they  offer  quite  different  interpre¬ 
tations  for  rough  set  approximations.  Definition  (defl)  focuses  on  equivalence 
classes,  which  clearly  shows  how  relationships  between  elements  of  U  are  used. 
The  approximations  of  an  arbitrary  subset  of  the  universe  stem  from  the  gran¬ 
ulation  of  universe  by  an  equivalence  relation.  This  definition  can  be  extended 
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to  define  approximation  operators  based  on  other  types  of  binary  relations  [17]. 
Definition  (def2)  focuses  on  a  subsystem  of  U  with  special  properties.  With  less 
elements  in  the  subsystem  than  that  in  the  power  set,  certain  elements  of  the 
power  set  have  to  be  approximated.  The  formulation  can  be  easily  applied  to 
situations  where  a  binary  relation  is  not  readily  available.  It  has  been  used  to 
study  approximation  in  mathematical  structures  such  as  topological  spaces,  clo¬ 
sure  systems.  Boolean  algebras,  lattices,  and  posets  [1,  14,  15]. 

In  generalizing  definition  (def2),  subsystems  of  the  power  set  must  be  prop¬ 
erly  chosen  [15].  The  subsystem  for  defining  lower  approximations  must  contain 
the  empty  set  0  and  be  closed  under  union,  and  the  subsystem  for  defining  upper 
approximations  must  contain  the  entire  set  U  and  be  closed  under  intersection. 
In  other  words,  the  subsystem  for  defining  upper  approximation  must  be  a  clo¬ 
sure  system  [2].  In  general,  the  two  subsystems  are  not  necessarily  the  same, 
nor  dual  to  each  other  [1,  15].  The  subsystem  Def(C/)  induced  by  an  equivalence 
relation  is  only  a  special  case. 


3  Rough  Relation  Approximations 

This  section  first  reviews  a  commonly  used  formulation  of  rough  relations  based 
on  definition  (defl)  and  discusses  its  limitations.  By  extending  definition  (def2), 
we  present  a  new  formulation. 


3.1  A  commonly  used  formulation 

A  binary  relation  jR  on  a  universe  17  is  a  set  of  ordered  pairs  of  elements  from  U , 
i.e.,  RCUxU,  The  power  set  of  C/  x  t/,  i.e.,  is  the  set  of  all  binary  relations 

on  U,  The  empty  binary  relation  is  denoted  by  0,  and  the  whole  relation  is  UxU. 
One  may  apply  set-theoretic  operations  to  relations  and  define  the  complement, 
intersection,  and  union  of  binary  relations.  By  taking  C/  x  17  as  a  new  universe, 
one  can  immediately  study  approximations  of  binary  relations.  For  clarity,  we 
only  consider  binary  relations  on  the  same  universe,  instead  of  the  general  case 
where  relations  are  defined  on  more  than  two  distinct  universes  [7]. 

Suppose  El  and  E2  are  two  equivalence  relations  on  U.  They  induce  two 
approximation  spaces  apri  =  {U,Ei)  and  apr2  =  (C/,  £*2).  The  product  relation 
E  =  Eix  E2: 

{x,  y)E{v,  w)  xEiv,  yE2‘w,  (2) 

is  an  equivalence  relation  on  17  x  U .  It  gives  rise  to  a  product  approximation 
space  apr  =  {U  x  U^Ei  x  £'2).  In  the  special  case,  a  single  approximation  space 
apru  =  {U,Eu)  can  be  used  to  derive  the  product  approximation  space  apr  = 
(U  X  U,  Eu  X  Eu)-  The  notion  of  product  approximation  space  forms  a  basis  for 
rough  relation  approximations.  For  an  equivalence  relation  E  C  {U  x  U)^,  the 
equivalence  class  containing  {x,y): 

[(*. y)]B  =  {(v. “')  I  (®> y)E(v, «<), (®> y), («. w)eux  u], 

=  Wbi  X  Meji 


(3) 
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is  in  fact  a  binary  relation  on  C/.  It  is  called  an  elementary  definable  relation.  The 
empty  relation  0  and  unions  of  elementary  definable  relations  are  referred  to  as 
definable  relations.  The  family  of  definable  relations  is  denoted  by  Def(C/  x  U). 
Although  definable  relations  are  constructed  from  an  equivalence  relation  E 
on  U  X  Uy  relations  in  Def(;7  x  U)  are  not  necessarily  reflexive,  symmetric,  or 
transitive.  This  can  be  easily  seen  from  the  fact  that  the  elementary  relations 
[{^iy)]E  do  not  necessarily  have  any  of  those  properties. 

Given  a  binary  relation  R  CU  xU^hy  definition  (defl)  we  can  approximate 
it  by  two  relations: 

(defl)  apr{R)  =  (J  {[(x,;/)]b  |  (s.j/)  €  x  U,  [(x,  j/)]^  C  /?}, 

Wr(R)  =  IJ  {[(®,S/)]b  I  (x,3/)  6  X  Cl,  [(x,j/)]BnE  #  0}. 

Equivalently,  definition  (def2)  can  be  used  with  respect  to  the  subsystem  Def(U  x 
U).  The  rough  relation  approximations  are  dual  to  each  other  and  satisfy  prop¬ 
erties  (L1)-(L4)  and  (U1)-(U4).  Since  a  binary  relation  is  a  set  with  added  in¬ 
formation,  one  can  observe  the  following  additional  facts  [7,  11,  13]: 

1.  Suppose  E  =  Eu  X  Eu-  Eu  ^  lu,  neither  apr{Iu)  nor  a^{Iu) 
is  the  identity  relation,  where  lu  =  {(a:,  a?)  \  x  £  U}  denotes  the 
identity  relation  on  U, 

2.  For  a  reflexive  relation  R,  a:pr(R)  is  reflexive,  and  a£i^{R)  is  not 
necessarily  reflexive. 

3.  For  a  symmetric  relation  i?,  both  apr(R)  and  a^{R)  are  symmetric. 

4.  For  a  transitive  relation  i2,  apr(R)  and  apr{R)  are  not  necessarily 
transitive. 

5.  For  an  equivalence  relation  R,  apr{R)  and  apr{R)  are  not  necessarily 
equivalence  relations. 

6.  Suppose  E  =  Eu  X  Eu-  For  an  equivalence  relation  R,  apf(R)  is  an 
equivalence  relation  if  and  only  iiapr{R)  =  (RU  Eu)* ,  and  apr{R) 
is  an  equivalence  relation  if  and  only  if  Eu  C  R,  where  R*  denotes 
the  reflexive  and  transitive  closure  of  a  relation  R. 

One  can  therefore  conclude  that  the  lower  and  upper  approximations  of  a  relation 
may  not  have  all  the  properties  of  the  relation  to  be  approximated.  If  an  arbitrary 
relation  is  approximated  by  elements  of  Def(Cf  x  U),  one  cannot  expect  certain 
properties  of  its  approximations.  However,  in  some  situations,  it  may  be  desirable 
that  a  relation  is  approximated  by  relations  having  certain  specific  properties. 
We  clearly  cannot  achieve  this  goal  with  the  standard  formulation  of  rough 
relations. 


3.2  A  new  formulation 

If  subsystems  of  U  x  U  are  properly  chosen,  some  of  the  difficulties  identified  in 
the  last  section  can  be  avoided  by  generalizing  definition  (def2).  In  what  follows, 
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a  new  formulation  is  presented,  with  focus  on  properties,  such  as  reflexivity, 
symmetry,  and  transitivity,  of  binary  relations. 

Let  P  =  {reflexive,  symmetric,  transitive}  ~  {?’,«,  0  denote  a  set  of  proper¬ 
ties  of  binary  relations  on  U.  For  AC  P,  the  set  of  binary  relations  satisfying  all 
properties  in  A  is  denoted  by  Sa^  For  instance,  S{r,s}  consists  of  all  reflexive  and 
symmetric  relations  (i.e.,  tolerance  or  compatibility  relations).  One  can  verify 
the  following  properties: 

1.  The  system  ^{r}  is  closed  under  both  intersection  and  union.  It  does 
not  contain  the  empty  relation,  i.e.,  0  ^  S{r}  j  and  contains  the  whole 
relation,  i.e.,  U  x  U  E  S{r}> 

2.  The  system  5{,}  is  closed  under  both  intersection  and  union.  It  con¬ 
tains  both  the  empty  relation  and  the  whole  relation. 

3.  The  system  5{t}  is  closed  under  intersection,  but  not  closed  under 
union.  It  contains  both  the  empty  relation  and  the  whole  relation. 

4.  The  system  of  compatibility  relations  5'{r,s}  is  closed  under  both 
intersection  and  union.  It  contains  the  whole  relation,  and  does  not 
contain  the  empty  relation. 

5.  The  system  is  closed  under  intersection  and  not  closed  under 

union.  It  contains  the  whole  relation,  and  does  not  contain  the  empty 
relation. 

6.  The  system  S{s^t}  is  closed  under  intersection.  It  contains  both  the 
empty  relation  and  the  whole  relation. 

7.  The  system  of  equivalence  relations  is  closed  under  intersec¬ 

tion,  but  not  closed  under  union.  It  contains  the  whole  relation,  and 
does  not  contain  the  empty  relation. 

They  represent  all  possible  subsystems  with  properties  in  the  set  P.  It  is  inter¬ 
esting  to  note  that  the  subsystem  Def(C/  x  U)  induced  by  an  equivalence  relation 
on  U  X  U  does  not  belong  to  any  of  the  above  classes.  Subsystems  that  can  be 
used  for  various  approximations  are  summarized  as  follows: 

Lower  approximation: 

S{r]  U  {0},  5{,},  5{r,,}  u  {0}. 

Upper  approximation: 

All  subsystems. 

Lower  and  upper  approximations: 

5{r}  U  {0},  S{s},  5{r,5}  U  {0}. 

Although  every  subsystem  can  be  used  for  defining  upper  approximation,  only 
three  subsystems  can  be  used  for  lower  approximation. 

Given  a  subsystem  Si  C  2^^^  containing  0  and  being  closed  under  union,  and 
a  subsystem  Su  C  2^^^  containing  U  x  U  and  being  closed  under  intersection, 
the  rough  relation  approximation  of  a  binary  relation  R  is  defined  by: 

(def2)  apr{R)  =  [J{Q  |  Q  G  Si, Q  C  R}, 
apr{R)  =  Pl{Q  I  (5  e  S„,  Ji  C  Q}. 
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In  the  special  case,  two  subsystems  can  be  the  same.  For  example,  one  may 
use  the  subsystem  of  compatibility  relations.  By  definition,  rough  relation  ap¬ 
proximations  satisfy  properties  (L2),  (L3),  (U2),  (U3),  and  the  following  weaker 
version  of  (LI)  and  (Ul): 

(LO)  RCQ  =>  apr(R)  C  apr{Q), 

(UO)  RCQ=^  ^{R)  C  Wr{Q)  • 

A  detailed  discussion  of  such  subsystems  in  the  setting  of  rough  set  approxima¬ 
tions  can  be  found  in  a  recent  paper  by  Yao  [15]. 

Regarding  the  subsystems  characterized  by  properties  in  P  =  {r,  s,f},  we 
have  the  following  results: 

(i) .  Suppose  the  pair  of  subsystems  (5{r}  U{0},  S{r})  is  used  for  defining 

lower  and  upper  approximations.  We  have: 

r  _  /  0  if  lu  i  R^ 

\  iJ  if  luCR, 

apr{R)  =  RU  lu- 

(ii) .  For  the  subsystem  5'{5},  we  have: 

apr{R)  =  Rn  R~^ , 
a^{R)  =  RU 

where  R~^  =  {(y,  x)  \  xRy]  is  the  inverse  of  the  relation  R. 

(iii) .  For  the  subsystem  S{t},  we  have: 

a:pr{R)  = 

where  R+  denotes  the  transitive  closure  of  the  binary  relation  R. 

(iv) .  For  the  subsystem  5'{r,5}  U  {0},  we  have: 

if  Iui:R: 

ifiuCR, 

apr{R)  =  RU  luU  R~^. 

(v) .  For  the  subsystem  S{r,t}f  we  have: 

apr{R)  —  lu  U  R'^ 

(vi) .  For  the  subsystem  we  have: 

apr{R)  =  {RU  R~^)'^ . 

(vii).  For  the  subsystem  we  have: 

apf{R)  =  (RU  luU  R~^)'^. 
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One  can  see  that  the  lower  approximation  is  obtained  by  removing  certain  pairs 
from  the  relation,  while  the  upper  approximation  is  obtained  by  adding  certain 
pairs  to  the  relation,  so  that  the  required  properties  hold.  This  interpretation 
of  approximation  is  intuitively  appealing.  The  definition  (def2)  only  provides  a 
formal  description  of  rough  relation  approximations.  In  practice,  one  can  easily 
obtain  the  approximations  without  actually  constructing  the  subsystems  and 
using  definition  (def2). 

When  the  subsystem  S{r}  U  {0}  is  used  for  lower  and  upper  approximations, 
reflexive  relations  are  fixed  points.  That  is,  both  lower  and  upper  approximations 
of  a  reflexive  relation  equal  to  the  relation  itself.  Similar  observations  hold  for 
other  subsystems. 

Our  formulation  of  rough  relation  approximations  is  very  flexible.  In  ap¬ 
proximating  a  relation,  two  different  subsystems  may  be  used,  one  for  lower 
approximation,  and  the  other  for  upper  approximation.  For  example,  one  may 
approximate  an  arbitrary  binary  relation  from  below  by  a  compatibility  relation, 
and  from  above  by  an  equivalence  relation.  If  the  relation  is  reflexive,  then  the 
lower  approximation  is  obtained  by  deleting  pairs  that  violate  the  property  of 
symmetry,  while  the  upper  approximation  is  obtained  by  adding  pairs  so  that 
the  transitivity  holds.  Such  a  pair  of  lower  and  upper  approximations  provides 
a  good  characterization  of  the  original  relation.  The  subsystems  discussed  so 
far  are  some  examples.  In  general,  one  can  construct  various  subsystems  for  ap¬ 
proximation  as  long  as  they  obey  certain  properties.  The  subsystem  for  lower 
approximation  must  contain  0  and  be  closed  under  union,  and  the  subsystem 
for  upper  approximation  must  contain  U  x  U  and  be  closed  under  intersection. 
For  example,  for  defining  both  lower  and  upper  approximations  one  may  select 
a  subset  of  S{r,s,t}  U  {0}  such  that  it  is  closed  under  both  intersection  and  union. 


4  Conclusion 

A  binary  relation  is  not  simply  a  set  of  pairs,  but  a  set  with  additional  informa¬ 
tion  and  properties.  The  problem  of  rough  relation  approximation  may  therefore 
be  different  from  rough  set  approximations.  In  contrast  to  other  related  studies, 
the  main  purpose  of  this  paper  is  to  investigate  possibilities  of  using  such  ex¬ 
tra  information  in  approximating  relations.  An  alternative  formulation  of  rough 
relations  is  proposed  based  on  subsystems  of  binary  relations  with  certain  prop¬ 
erties,  instead  of  using  equivalence  relations.  From  a  quite  different  point  of  view, 
our  formulation  explicitly  addresses  some  fundamental  issues  which  have  been 
overlooked  in  existing  studies  of  rough  relations. 

The  results  as  shown  by  (i)-(vii)  are  simple  and  they  could  have  been  obtained 
easily  without  the  introduction  of  the  new  framework.  However,  the  importance 
of  the  approach  may  not  be  taken  lightly.  The  recognization  and  utilization  of 
special  classes  of  binary  relations  for  approximating  other  binary  relations  may 
have  significant  implications  on  the  understanding  and  applications  of  rough 
relation  approximations.  The  results  may  be  applied  to  rough  function  approxi¬ 
mations  [8].  In  this  paper,  we  only  considered  three  properties  of  binary  relations. 
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With  our  formulation,  other  properties  of  binary  relations  can  also  be  considered. 

Order  relations  (i.e.,  preference  relations)  play  a  very  important  role  in  decision 

theory  [4,  5] .  It  may  be  useful  to  apply  the  proposed  method  for  approximating 

order  relations. 
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abstract 

In  this  paper,  we  present  a  novel  approach  for  approximating  concepts  in  the 
framework  of  formal  concept  analysis.  Two  main  problems  are  investigated.  The 
first,  given  a  set  A  of  objects  (or  a  set  B  of  features),  we  want  to  find  a  formal 
concept  that  approximates  A  (or  B).  The  second,  given  a  pair  (A^B),  where 
A  is  a  set  of  objects  and  B  is  a  set  of  features,  the  objective  is  to  find  formal 
concepts  that  approximate  (A,B).  The  techniques  developed  in  this  paper  use 
ideas  from  rough  set  theory.  The  approach  we  present  is  different  and  more 
general  than  existing  approaches. 

1  Introduction 

Formal  concept  analysis  (FCA)  is  a  mathematical  framework  developed  by  Rudolf 
Wille  and  his  colleagues  at  Darmstadt/ Germany  that  is  useful  for  representa¬ 
tion  and  analysis  of  data  [8].  A  pair  consisting  of  a  set  of  objects  and  a  set  of 
features  common  to  these  objects  is  called  a  concept.  Using  the  framework  of 
FCA,  concepts  are  structured  in  the  form  of  a  lattice  called  the  concept  lattice. 
The  concept  lattice  is  a  useful  tool  for  knowledge  representation  and  knowledge 
discovery  [2].  Formal  concept  analysis  has  also  been  applied  in  the  area  of  con¬ 
ceptual  modeling  that  deals  with  the  acquisition,  representation  and  organization 
of  knowledge  [4].  Several  concept  learning  methods  have  been  implemented  in 
[1,  2,  3]  using  ideas  from  formal  concept  analysis. 

Not  every  pair  of  a  set  of  objects  and  a  set  of  features  defines  a  concept  [8]. 
Furthermore,  we  might  be  faced  with  a  situation  where  we  have  a  set  of  features 
(or  a  set  of  objects)  and  need  to  find  the  best  concept  that  approximates  these 
features  (or  objects).  For  example,  when  a  physician  diagnosis  a  patient,  he  finds 
a  disease  whose  symptoms  are  the  closest  to  the  symptoms  that  the  patient  has. 

*This  research  was  supported  in  part  by  the  Army  Research  Office,  Grant  No.  DAAH04* 
96-1-0325,  under  DEPSCoR  program  of  Advanced  Research  Projects  Agency,  Department  of 
Defense  and  by  the  U.S.  Department  of  Energy,  Grant  No.  DE-FG02-9  7ER1220. 
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In  this  case  we  can  think  of  the  symptoms  as  features  and  diseases  as  objects.  It 
is  therefore  of  fundamental  importance  to  be  able  to  find  concept  approximations 
regardless  how  little  information  is  available. 

In  this  paper  we  present  a  general  approach  for  approximating  concepts.  We 
first  show  how  a  set  of  objects  (or  features)  can  be  approximated  by  a  concept. 
We  prove  that  our  approximations  are  the  best  that  can  be  achieved  using  rough 
sets.  We  then  extend  our  approach  to  approximate  a  pair  of  a  set  of  objects  and 
a  set  of  features. 


2  Background 

Relationships  between  objects  and  features  in  FCA  is  given  in  a  context  which  is 
defined  as  a  triple  (C?,  M,  J),  where  G  and  M  are  sets  of  objects  and  features  (also 
called  attributes),  respectively,  and  I  CG  x  M.  If  object  g  possesses  feature  m, 
then  {g,m)  e  I  which  is  also  written  as  gim.  The  set  of  all  common  features  to 
a  set  of  objects  A  is  denoted  by  P{A)  and  defined  as  {m  E  M  |  gIm  E  A}. 
Similarly,  the  maximal  set  of  objects  possessing  all  the  features  in  a  set  of  features 
B  is  denoted  by  a{B)  and  given  by  E  G  |  gim  Vm  e  B}.  A  formal  concept 
is  defined  as  a  pair  (A,  B)  where  A  C  G,  B  C  M,  P{A)  =  B  and  a{B)  =  A.  A 
is  called  the  extent  of  the  concept  and  B  is  called  its  intent. 

Using  the  above  definitions  of  a  and  /?,  it  is  easy  to  verify  that  Ai  C  A2 
implies  that  P(Ai)  D  P{A2),  and  Bi  C  B2  implies  that  a(Bi)  D  a{B2)  for 
every  Ai,A2  C  G,  and  Bi,B2  C  M  [8].  Let  C{G,M,I)  denote  the  set  of  all 
concepts  of  the  context  {G^M,I)  and  {Ai,Bi)  and  (A2,i52)  be  two  concepts 
in  C{GjM,I).  {Ai,Bi)  is  called  a  subconcept  of  (^.2,^2)  which  is  denoted  by 
{Ai,Bi)  <  (^2,^2)  whenever  Ai  is  a  subset  of  A2  (or  equivalently  Bi  contains 
^2)*  The  relation  <  is  an  order  relation  on  C(G,  M,/). 

In  the  sequel  we  give  an  overview  of  few  basic  rough  set  theory  terms.  Let 
17  be  a  nonempty  finite  set  of  objects  called  the  Universe.  Let  A  be  a  set  of 
attributes.  Associate  with  each  a  E  A  a  set  14  of  possible  values  of  a  called 
its  domain.  Let  a(x)  denote  the  value  of  the  attribute  a  for  element  x.  Let  .B  be 
a  subset  of  A  {B  can  be  equal  to  A).  A  binary  relation  on  U  is  defined  as 
xR^y  a{x)  =  a(?/)Va  E  B.  Clearly,  R^  is  an  equivalence  relation  and  thus 
forms  a  partition  on  U.  Let  [x]b  denote  the  equivalence  class  of  x  with  respect 
to  R^.  When  B  is  clear  from  context,  we  will  write  [a;]  instead  of  [x]b‘  Let 
U/R^  denote  the  set  of  all  equivalence  classes  determined  by  R^.  Equivalence 
classes  of  the  relation  R^  are  called  .B-elementary  sets  (or  just  elementary  sets). 
Any  finite  union  of  elementary  sets  is  called  a  definable  set. 

Given  a  set  X  C  U ^  X  may  not  be  definable.  The  relation  R^  can  be 
used  to  characterize  X  by  a  pair  of  definable  sets  called  its  lower  and  upper 
approximations.  The  lower  and  upper  approximations  of  X  with  respect  to 
R^  (or  set  of  attributes  B)  are  defined  as  B{X)  =  {m  E  U  \  [m]j5  C  AT}  and 
B{X)  ~  {meU  \  [m]BnX  0},  respectively.  Clearly,  the  lower  approximation 
of  X  is  the  greatest  definable  set  contained  in  X  and  the  upper  approximation 
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of  X  is  the  smallest  definable  set  containing  X.  The  difference  between  the 
upper  and  lower  approximations  of  X  is  known  as  the  boundary  region  of  X 
and  is  denoted  by  BND{X).  If  BND{X)  is  an  empty  set,  then  X  is  a  definable 
set  with  respect  to  B\  on  the  other  hand,  if  BND{X)  is  not  empty,  then  X  is 
referred  to  as  a  rough  set  with  respect  to  B  [6]. 


3  Existing  Approach 

The  existing  approach  for  approximating  concepts  is  due  to  Kent  and  is  called 
rough  concept  analysis  [5].  It  relies  on  the  existence  of  an  equivalence  relation, 
E,  on  the  set  of  objects,  G,  that  is  provided  by  an  expert.  A  pair  (G,  E)  where  E 
is  an  equivalence  relation  on  G  is  called  an  approximation  space.  An  E-definable 
formal  context  of  G-objects  and  M-attributes  is  a  formal  context  (G,  M,  I)  whose 
elementary  extents  {Im  \  m  E  M}  are  ^/-definable  subsets  of  G-objects  where 
Im  ~  {g  e  G  \  gim}. 

The  lower  and  upper  ^^-approximations  of  I  with  respect  to  (G,  E)  are  de- 
noted  by  and  I  ,  respectively,  and  given  by 

Zb  =  {(fl,w)  I  blfi  ^  =  {{g,m)  \[g]Er\  Im  ^  9}. 

The  formal  context  (G,  M,  I)  can  be  approximated  by  the  lower  and  upper 
contexts  (G,M,7^)  and  (G,M,I^).  The  rough  extents  of  an  attribute  set  B  C 
M  with  respect  to  and  are  defined  by 

aiRs)  =  q(.B)p  =  f]  Me  and  a(B^)  =  ^^1)^  =  f] 

meB  rneB 

Any  formal  concept  {A,B)  E  C{G,M,I)  can  be  approximated  by  means  of 

and  7^.  The  lower  and  upper  ^^-approximations  of  {A,  B)  are  given  by 

(4,B)^  =  (a(&)./3(a(BB)))  and  {A,Bf  =  ia(B^),  Pia(B^))) 

4  Formal  Rough  Concept  Analysis 

In  the  previous  section  we  presented  an  overview  of  the  existing  approach  for 
approximating  concepts.  This  approach  is  not  direct  because  upper  and  lower 
approximations  for  the  context  (G,  iVf,  I)  have  to  be  found  first  and  then  used  for 
approximating  a  pair  (A,  S)  of  objects  and  features.  The  resulting  upper  and 
lower  approximations  of  (G,  M,  I)  depend  on  the  approximation  space  (G,  E) 
as  described  in  Section  3.  This  means  that  different  equivalence  relations  on  G 
would  result  in  different  answers.  Furthermore,  the  set  A  was  not  used  in  the 
approximation.  This  means  that  all  pairs  that  have  the  same  set  of  features  will 
always  have  the  same  lower  and  upper  jE- approximations. 
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In  this  section  we  present  a  different  and  more  general  approach  for  approxi¬ 
mating  concepts.  Our  approach  is  consistent  in  that  the  relation  or  relations  we 
use  in  the  approximation  are  defined  in  a  way  that  assures  that  the  same  answer 
is  always  given.  We  first  show  how  a  set  A  of  objects  (or  B  of  features)  can  be 
approximated  by  a  concept  whose  extent  (intent)  approximates  A  (B).  Then  we 
show  how  a  pair  (A,  B)  can  be  be  approximated  by  one  or  two  concepts. 

First,  a  few  definitions  need  to  be  given.  Let  (G,  M,  I)  be  a  context,  not 
every  subset  A  C  G  is  an  extent  nor  every  subset  B  C  M  is  an  intent.  Wille  [8] 
has  shown  that  A  C  a{P{A))  for  any  A  C  G  and  B  C  l3{a{B))  for  any  B  C  M. 
Furthermore,  =  ClieJ  a(Uiej-Bi)  =  ^^ere  J 

is  and  index  set.  This  later  result  will  be  used  later.  A  set  of  objects  A  is  called 
feasible  if  A  =  a{/3(A)).  Similarly  a  set  of  features  is  feasible  if  B  =  P{a{B)). 
If  A  is  feasible,  then  clearly  (A,^(A))  is  a  concept.  Similarly,  if  B  is  feasible, 
then  (a(B),B)  is  a  concept.  Let  us  also  say  that  a  set  A  C  G  is  definable 
if  it  is  the  union  of  feasible  extents;  otherwise,  we  say  that  A  is  nondefinable. 
Similarly,  B  C  M  is  definable  if  it  is  the  union  of  feasible  intents;  otherwise,  B 
is  nondefinable.  A  pair  (A,  B)  is  called  a  definable  concept  if  both  A  and  B  are 
definable,  a(B)  =  A  and  ^(A)  =  B;  otherwise,  (A,  B)  is  a  nondefinable  concept. 

4.1  Approximating  a  Set  of  Objects 

Given  a  set  of  objects  A  C  G,  we  are  interested  in  finding  a  definable  concept 
that  approximates  A.  We  have  the  following  cases: 

Case  1:  A  is  feasible.  Clearly  (A, /?(A))  is  a  definable  concept.  Therefore, 
(A,/?(A))  is  the  best  approximation. 

Case  2:  A  is  definable.  Since  A  is  definable,  it  can  be  written  as  A  =  Ai  U 
A2  . . .  U  An,  where  each  A^,  i  =  1, . . .  n,  is  feasible. 

Hence,  p{A)  =  ^(Ai  U A2  . .  .U An)  =  p{Ai)np(A2) n . . . n^(An)  =  nSi 
Therefore,  when  A  is  definable,  the  best  approximation  is  obtained  by 

(A,/3(A))  =  (UAi,n/?(.4i)). 

i=l  i=l 

Case  3:  A  is  nondefinable.  If  A  is  nondefinable,  it  is  not  as  straightforward  to 
find  a  definable  concept  that  approximates  A.  Our  approach  is  to  think  of  A  as 
a  rough  set.  We  first  find  a  pair  of  definable  sets  A  and  A  that  best  approximate 
A.  A  and  A  are  then  used  in  finding  two  concepts  that  best  approximate  A. 

Let  gl  =  {m  e  M  \  gim}  denote  the  set  of  all  features  that  are  possessed  by 
the  object  g.  Define  a  relation  B  on  G  as  follows: 


giRg2  iff  gJ  =  g2l  where  pi,P2  €  G. 

Clearly,  R  is  reflexive,  symmetric  and  transitive.  Thus,  R  is  an  equivalence 
relation  on  G.  Let  G/R  be  the  set  of  all  equivalence  classes  induced  by  B  on  G. 
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Lemma  4.1  Each  equivalence  class  X  G  G/R  is  a  feasible  extent. 

Proof.  Assume  not,  that  is,  there  is  an  X  G  G/R  which  is  not  a  feasible  extent. 
Therefore,  X  C  P{a{X))  and  X  ^  l3{a{X)).  So  there  is  an  object  g  G  /3(a(X))- 
X  such  that  gim  Vm  G  a(X).  But  this  is  a  contradiction  because  by  the 
definition  of  R^  g  must  be  in  X  because  it  has  all  the  features  in  a(X).  Therefore, 
X  ~  l3{a{X))  which  means  that  X  is  a  feasible  extent.  □ 

It  follows  from  the  previous  lemma  that  each  equivalence  class  X  e  G/R  is 
a  feasible  extent  and  thus  is  an  elementary  extent.  Therefore,  define  the  lower 
and  upper  approximations  of  A  G  G  with  respect  to  R  as 

A-{5eG|[5]CA},  and  A  =  {g  e  G  \  [g]n  A  ^ 

Now,  we  can  find  two  concepts  that  approximate  A.  The  lower^pprOTimation 
is  given  by  (A,/? (A))  and  the  upper  approximation  is  given  by  (A, /3(A)). 

Lemma  4.2  If  A  is  a  nondefinable  extent,  then  the  best  lower  and  upper  ap¬ 
proximations  are  given  by  (A,  ^(A))  and  (A, /3(A));  respectively. 

Proof.  A  is  a  union  of  feasible  extents  and  thus  is  a  definable  extei^.  T^refore, 
(A,^(A))  is  a  definable  concept.  Similarly,  we  can  show  that  (A, /3(A))  is  a 
definable  conce^.  _  _ 

Since  A  C  A  C  A,  we  have  (A, /3(A))  <  (A, /3(A))  ^  (A, /3(A)).  Furthermore,  A 
is  the  greatest  definable  extent  contained  in  A  and  A  is  the  least  definable  extent 
containing  A.  This  implies  that  (A,  P{A))  is  the  greatest  definable  subconcept  of 
(A, /3(A))  and  (A, /3(A))  is_the  least  definable  superconcept  of  (A, /3(A)).  There¬ 
fore,  (A, /3(A))  and  (A,p(A))  are  the  best  lower  and  upper  approximations. 

□ 


4.2  Approximating  a  Set  of  Features 

Because  approximating  a  set  of  features  is  similar  to  approximating  a  set  of  ob¬ 
jects  and  because  of  limitations  of  space,  we  will  omit  some  unnecessary  details. 

Case  1:  B  is  feasible.  The  concept  {a{B),B)  best  approximates  B. 

Case  2:  B  is  definable.  B  can  be  written  as  B  Bi  where  each  Bj,  is 

feasible.  Hence,  a(B)  =  a(U£‘i  Bt)  =  D'J'i  a(Bi)  Therefore,  B  can  be  approx¬ 
imated  by  the  definable  concept  (a(B),B)  =  Bi) 

Case  3:  B  is  nondefinable.  Let  Im  =  {g  ^  G  \  gIm}  be  the  set  of  all  objects 
that  posses  the  attribute  m.  Define  a  relation  i?'  on  M  as  follows: 

miR'm2  iff  Imi  -  Im2  where  mi, m2  G  M. 

Clearly,  R'  is  an  equivalence  relation.  Let  G/R‘  be  the  set  of  all  equivalence 
classes  induced  by  R'  on  M. 
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Lemma  4.3  Each  equivalence  class  Y  €  G/R'  is  a  feasible  intent  and  thus  an 
elementary  set. 

Using  the  result  of  the  previous  lemma,  the  lower  and  upper  approximations 
of  B  e  M  with  respect  to  R  are  defined  by 

B  =  {m€M\[Tn]C  B},  and  B  =  {m  €  M  |  [m]  H  B  ^  0} 

The  next  lemma,  which  can  be  proved  similar  to  lemma  4.2,  gives  two  concepts 
that  best  approximate  B. 

Lemma  4.4  If  B  is  a  nondefinable  intent,  best  lower  and  upper  ap¬ 

proximations  are  given  by  (a(B),B)  and  {a{B),B),  respectively. 

4.3  Approximating  A  Concept 

Given  a  pair  {A,B)  where  ^  C  G  and  B  C  M,  we  want  to  find  one  or  two 
concepts  approximating  {A,B).  Four  different  cases  need  to  be  considered: 

I)  Both  A  and  B  are  definable,  II) A  is  definable  and  B  is  not,  III)B  is  definable 
and  A  is  not,  and  IV)  Both  A  and  B  are  nondefinable. 

4.3.1  Both  A  and  B  are  Definable 

Four  subcases  need  to  be  considered. 

1.  Both  A  and  B  are  feasible.  If  0{A)  =  B,  then  a{B)  must  equal  to  A  because 
both  A  and  B  are  feasible.  Thus  the  given  concept  {A,  B)  is  definable  and  no 
approximation  is  needed. 

If  p{A)  ^  B,  (and  thus  a{B)  ^  A),  let  y0(A)  =  A'  and  a{B)  =  B' .  Since  both 
A  and  B  are  feasible,  then  both  {A,  A')  and  {B',B)  are  definable  concepts  in 
(G,  M,  /).  Consider  the  two  concepts  {A  U  B',P{A  U  B'))  =  (A  U  B\  A'  f!  B) 
and  (a(A'  U  B),  A'  U  B)  =  (A  H  B',  A'  U  B).  We  notice  that  /3(B')  =  B  and 
a(A')  =  A  because  B  and  A  are  feasible.  Furthermore,  (A  fi  B',  A'  U  B)  < 
(A,  B)  <  (A  U  B',  A'  n  B).  Therefore,  the  lower  and  upper  approximations  of 
(A,  B)  are  given  by  (A,  B)  =  (AflB^A'  UB)  and  (A,B)  =  (AuB^  A'  flB). 

2.  A  is  feasible  and  B  is  not.  Since  B  is  definable,  it  can  be  written  as  a  union 
of  feasible  intents.  Let  B  =  [JlZ^Bi  where  Bi  is  feasible  for  z  =  1, 2, . . , ,  m.  Let 
a{Bi)  =  Bi  ,  for  i  =  1,2, . . .  ,m,  and  Q!(B)  =  B'. 

B'  =  «(5)  =  BO  =  n,=x  ^(B,)  =  B/. 

Therefore,  the  lower  and  upper  approximations  of  (A,  B)  are  given  by 
{A,B)  =  (^  n  B',  A'UB)  =  ^ 
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and 

=  {A  \JB',  A' nB)  =  ^  -Si')  -  ^ 


3.  B  is  feasible  and  A  is  not.  This  case  is  similar  to  the  previous  case  and  details 
are  omitted.  The  lower  and  upper  approximations  are  given  by 

iA,B)  =(AnB',A'llB)  =(Ui='(^nB'),nfei(^t'uB)),  and 
{A,B)  ={AvB',A'nB)  =(U-=i(AuB'),n£i(^i'nB)). 

Where  A  =  \J'^z\Ai  and  each  Ai  is  featsible  for  i  —  1,2, . . .  ,L 

4.  Both  A  and  B  are  not  feasible.  Since  A  and  B  are  definable,  they  can  be 
written  as  unions  of  feasible  extents  and  intents,  respectively. 

Let  A  =  Ui=i^i  ^  “  Uj=i where  Ai  and  Bj  are  feasible  for  z  =  1, 2, . . . ,  Z 
and;  =  1,2,...,A;.  Letyl',  /!/,  and  Z?i 'denote /?(7l),  P{Ai),  a{B)y  and 

respectively.  Then, 

A'  =P{A)  =/3(U:iU0  =n£i-3(Ai)  =n£U/.  and 

B'  =a(B)  =a(UjZiBj)  =n’j=^c^(Sj)  =V^]z\B/. 

The  lower  and  upper  approximations  of  {A,  B)  are  given  by 

{A,B)  =(Ar\B',A'UB) 

=  ((UEUO  n  (njlfB/),  (nSU/)  U  and 

{A,B)  =[A\JB',A'C\B) 

=  ((UEUi)  U  {f]^rB/),  (n£Ui')  n  (U^j5,)). 

4.3.2  A  is  Definable  and  B  is  not 

Since  A  is  definable,  it  can  be  written  as  A  =  where  each  Ai  is  feasible. 

Define  a  binary  relation  R’  on  M  such  that  for  mi, m2  G  M,  miR'm2  if  Imi  = 
Im2.  Clearly,  R'  is  an  equivalence  relation  and  thus  can  be  used  in  creating  a 
rough  approximation  for  any  subset  of  M  as  was  done  earlier.  Let  B_  and  B  be 
the  lower  and  upper  approximations  of  B  with  respect  to  R' . 

B_  and  B  can  be  used  in  creating  lower  and  upper  approximations  for  {A,B) 
in  the  context  {G,M,I).  The  lower  and  upper  approximations  are  given  by 

(A,  B)  =  {An  a(B),  PjA)  U  B),  and  (A,  B)  =  ( A  U  a{B) ,  /3(A)  n  B) . 

The  concept  {Ar\a(B),P{A)UB)  is  definable  because  l3{Ana{B))  =  fi{A)UB. 
Similarly,  (A  U  a{B),0{A)  D  B)  is  definable  because  yd(A  U  a(B))  ~  /3(A)  n  B. 

We  can  show  that  the  approximations  developed  above  are  indeed  correct 
by  observing  that  B  C  B  C  B  which  implies  that  /3(A)  D  B  C  B  C  /3(A)  U  B 
Therefore, 
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{Ar\a{B),l5{A)DB)  <  {A,B)  <  {A\J  a{B),l5{A)  nB) 

To  get  the  final  answer  we  need  to  substitute  place  of  A.  However, 

we  choose  not  to  do  that  in  this  context  to  make  the  results  easier  to  read. 

4.3.3  B  is  Definable  and  A  is  not 

The  scenario  here  is  similar  to  that  in  the  previous  subsection  and  we  will  just 
sketch  the  results.  Since  B  is  definable,  it  can  be  written  as  B  where 

Bj  is  feasible  for  j  =  1, 2, . . . ,  A:.  Define  a  relation  on  G  by  giRg2  iff  pi/  =  52/- 
R  is  an  equivalence  relation  on  G  and  can  be  used  in  approximating  the  non- 
definable  set  A.  Let  A  and  A  represent  the  lower  and  upper  approximations  of 
A  with  respect  to  R.  The  lower  and  upper  approximations  of  (A,  B)  are  given 
by 


{A,B)  =  {AnaiB),p{A)UB)  and  {A,B)  =  {Au  a{B),l3{A)  f)  B). 

4.3.4  Both  A  and  B  are  Nondefinable 

Neither  A  nor  B  can  be  used  in  approximating  the  nondefinable  concept  (A,  B). 
However,  we  can  combine  the  approaches  from  the  previous  two  subsections  and 
define  R  to  be  an  equivalence  relation  on  G  and  define  R'  to  be  an  equivalence 
relation  on  M  using  the  same  definitions  from  the  previous  two  subsections.  L^t 
the  lower  and  upper  approximations  of  A  with  respect  to  G  be  given  by  A  and  A, 
respectively.  Similarly,  let  B  and  5  denote  ^e  lower  and  upper  approximations 
of  B  with  respect  to  R'.  Now,  A,  A,  B  and  B  are  definable  sets  that  can  be  used 
for  approximating  the  nondefinable  concept  (A,  J5).  The  lower  approximations 
is  given  by  _  _ 

(A,B)  =  {Ana{B),l3{A)UB), 

Similarly,  the  upper  approximations  is  given  by 

jA^=(AUa{B),P(A)nB), 

Clearly,  both  (A  H  a(B),l3{A)  U  B)  and  (A  U  a(B),/3(A)  D  B)  are  definable 
concepts.  Furthermore,  it  is  easy  to  prove  that 

(A  n  a(B),0{A)  UB)<  (A,  B)  <  (A  U  a(B), ^(A)  HB), 

Hence,  our  proposed  approximations  are  correct. 


5  Conclusion  and  Future  Work 

This  paper  presents  a  new  approach  for  approximating  concepts  using  rough 
sets.  Using  this  approach  the  given  context  is  used  directly  for  finding  upper 
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and  lower  approximations  for  nondefinable  concepts  without  any  context  ap¬ 
proximation  which  is  a  major  difference  from  the  approach  given  in  [5].  Another 
major  difference  between  our  approach  and  the  existing  approach  is  that  in  the 
existing  approach  an  equivalence  relation  on  the  set  of  objects  has  to  be  given 
by  an  expert  before  the  approximation  process  can  start.  Different  equivalence 
relations  would  usually  result  in  different  approximations  for  a  given  nondefin¬ 
able  concept.  Our  natural  choice  of  an  equivalence  relation  R  on  the  set  of 
objects  G  is  such  that  objects  that  share  the  same  set  of  features  are  grouped 
together  in  the  same  equivalence  class.  This  choice  guarantees  the  uniqueness 
of  our  approximations.  Furthermore,  such  a  definition  of  R  can  help  automate 
the  process  of  approximating  concepts. 

Using  the  new  approach,  we  showed  how  a  set  of  objects,  features,  or  a 
nondefinable  concept  can  be  approximated  by  a  definable  concept.  We  proved 
that  the  approximations  found  for  a  set  of  features  or  objects  are  the  best  one 
can  get.  The  ideas  developed  in  this  paper  are  useful  for  information  retrieval, 
knowledge  acquisition  and  conceptual  modeling.  An  implementation  of  a  system 
that  uses  the  ideas  developed  in  this  paper  is  currently  in  progress. 
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Abstract.  A  new  concept  of  reduction  of  non-stationary  noise  affecting  audio 
signals  transmitted  in  telecommunication  channels  is  proposed.  This  concept 
exploits  some  features  of  the  human  auditory  system  as  well  as  some  methods 
originated  from  soft  computing  domain,  i.e.  rough  set-based  reasoning  and 
neural  processing.  The  foundations  of  the  engineered  method  and  a  description 
of  applied  decision  algorithms  are  presented.  A  number  of  experiments  have 
been  prepared,  and  some  of  them  have  already  been  carried  out.  A  brief  dis¬ 
cussion  of  these  experiments'  results  and  some  conclusions  are  also  included. 


1  Introduction 

The  commonly  used  noise  reduction  methods  do  not  use  some  subjective  properties 
of  the  human  auditory  system,  which  have  been  successfully  exploited  in  audio  cod¬ 
ing  standards.  However,  as  was  revealed  by  the  results  of  experiments  carried  out  by 
the  authors,  auditory  masking  can  be  also  used  to  the  suppression  of  noise  corrupting 
audio  signals.  The  mathematical  foundations  of  this  perceptual  approach  and  rele¬ 
vant  algorithms  were  presented  in  some  recent  authors’  papers  [4] [6]. 

In  all  noise  reduction  methods,  there  is  a  need  to  know  at  least  approximated  sta¬ 
tistics  of  the  noise.  This  problem  becomes  more  complex  in  the  case  of  non- 
stationary  noise,  since  such  a  method  requires  choosing  a  certain  noise  statistics  from 
among  otliers.  Hence,  a  problem  of  an  efficient  decision  system  occurs.  Therefore 
intelligent  algorithms  (i.e.  rough  sets  or  neural  networks)  can  be  veiy  helpful  as 
decision  systems  in  this  field  of  application  [2]  [4]. 


2  Psychoacoustics  Principles 

The  concept  of  critical  bands  is  related  to  propagation  and  processing  of  acoustic 
signals  in  the  human  auditoiy  system.  Well-proven  phenomena  reveal  that  the  inner 
ear  behaves  as  a  bank  of  band-pass  filters  which  analyse  a  broad  spectral  range  in 
subbands  independently  from  others.  These  subbands  are  called  critical  bands,  and  a 
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perceptual  unit  of  frequency  has  been  introduced.  It  is  called  Bark  and  is  related  to 
the  width  of  a  single  subband.  Often  used  transformation  to  this  subjective  scale  of 
hearing  is  the  following  relation  proposed  by  Zwicker  [11]: 


6  =  13  •  arctg(0.76  •  10-"  •  /)  +  3.5 .  arctgl 


(1) 


where  b^f  denote  frequency  in  Barks  and  Hz,  respectively. 

Another  psychoacoustic  phenomenon  is  related  to  masking.  Some  tones  can  be  in¬ 
audible  in  the  presence  of  others,  especially  when  one  of  them  is  louder,  and  their 
frequencies  are  not  too  distant.  These  tones  which  mask  others  are  called  maskers, 
and  this  phenomenon  is  fundamental  for  contemporary  audio  coding  standards  [8]. 
More  details  on  psychoacoustics  can  be  found  in  abundant  literature  [11]. 


3  Description  Of  The  Perceptual  Noise  Reduction  System 

The  perceptual  noise  reduction  system  (Fig.  2)  is  fed  by  two  inputs:  the  noisy  signal 
y{m)  and  the  noise  patterns  n{m) .  The  signal  y{m)  consists  of  the  original  audio 

signal  x{m)  corrupted  by  the  noise  n{m) ,  and  is  transformed  to  the  spectral  repre¬ 
sentation  YQco)  with  the  use  of  the  DFT  procedure.  In  turn,  the  patterns  n(m)  are 
assumed  to  be  correlated  to  the  noise  n(m) ,  and  are  taken  from  empty  passages  of 
the  signal  transmitted  in  a  telecommunication  channel.  The  signal  n(m)  is  delivered 
to  the  Noise  Estimation  Module  which  task  is  to  collect  essential  information  on  the 
noise  n(m).  At  its  output,  the  time-frequency  noise  estimation  /?(t,  Jg?)  is  obtained. 
Both  this  estimation  p(t,}co)  and  the  spectrum  of  the  corrupted  audio  7(jG))  are 
supplied  to  the  Decision  Systems.  Its  first  task  is  to  select  one  of  the  collected  spec¬ 
tral  estimations  pQo))  c  p(t,jo})  which  is  correlated  best  to  the  corrupting  noise  in 
a  given  moment.  The  second  task  is  to  qualify  the  elements  of  the  signal  Y^im)  for 
two  disjoint  sets:  the  set  U  of  the  useful  or  the  set  D  of  the  useless  elements.  It  is 
necessaiy  to  know,  which  spectral  components  are  maskers  (useful),  and  which  ones 
are  to  be  masked  (useless). 


Fig.  1.  General  lay-out  of  the  noise  reduction  system 
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The  spectrum  of  the  corrupted  signal  Y (jfi?)  as  well  as  the  sets  U,  D  and  the  cho¬ 
sen  noise  estimation  p{t/}ooi)  are  fed  to  the  Perceptual  Noise  Reduction  Module  that 

A 

executes  a  perceptual  algorithm  of  noise  reduction.  Next,  the  output  r(ja>)  is  proc¬ 
essed  by  the  inverse  DFT  procedure,  and  finally  the  restored  signal  y{m)  is  ob¬ 
tained,  which  is  subjectively  perceived  as  less  noisy  than  the  original  one. 


4  Implementation  of  the  Noise  Reduction  System 

4.1  Noise  Estimation  Module 

The  Noise  Estimation  Module's  run  can  be  divided  into  two  modes:  the  noise  analy¬ 
sis  mode  and  the  noise  reduction  mode  (Fig.  2).  In  the  first  one  (Fig.  2a),  the  patterns 
n{m)  are  analysed  in  the  Extraction  of  Noise  Parameters  block  where  they  are 
transformed  into  the  spectral  domain,  averaged  upon  subsequent  L  frames  and  ana¬ 
lysed.  As  a  result,  two  kinds  of  output  are  obtained:  the  average  power  spectrum 

,  and  the  associated  vector  of  coefficients  related  to  the  spectrum  Nj^ .  The 
index  k  denotes  the  time  interval,  within  which  elements  of  these  vectors  are  com¬ 
puted.  Subsequent,  the  both  vectors  are  collected  in  the  Table  of  Vectors,  The  content 
of  the  table  is  used  during  the  training  of  decision  algorithms  in  the  Decision  Sys¬ 
tems  and  the  noise  reduction  mode  (Fig.  2b).  In  this  latter  mode,  due  to  a  query  to 
the  table,  the  appropriate  spectrum  N j  is  output,  which  is  expected  to  be  mostly 

correlated  to  the  noise  currently  corrupting  the  audio.  The  query  index  value  is  pro¬ 
duced  by  a  decision  system,  and  denotes  the  index  of  a  desired  spectrum  in  the  table. 

a)  b) 


Fig.  2.  Scheme  of  the  Noise  Estimation  Module:  (a)  noise  analysis  mode,  (b)  noise  reduction 
mode.  Dashed  lines  denote  inactive  connections 

In  the  case  of  use  of  the  A^-point  DFT,  the  vector  is  defined  as  below: 

k-L 

Nk=\Nxjc  N„,k=-7-  <2) 

^  i=ik-iyL+i 

A 

where  N„  i^  is  averaged  on  the  basis  of  last  L  values  of  the  spectral  power  N„ . 
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The  associated  vector  K"  serves  as  a  key  vector  and  is  exploited  during  the  noise 
reduction  mode  when  a  spectrum  N j  is  searched  for.  This  vector  should  be  unique, 
however  in  practice  the  condition  is  hard  to  be  ensured.  Its  elements  are  expected  to 

A 

reflect  quantitatively  a  noisy  character  of  the  average  spectrum  Nf^ .  Therefore  two 
kinds  of  parameters  are  considered  that  turned  out  to  be  very  robust  in  contemporary 
perceptud  coding  schemes  [8]:  the  Spectral  Flatness  Measure  [5]  and  the  unpredict¬ 
ability  measure  [1].  These  parameters  are  computed  in  each  critical  band,  and  their 
definitions  for  the  /-th  fi-ame  are  given  fiirther. 

Application  of  the  Spectral  Flatness  Measure.  The  SFM  parameter  is  defined  as 
the  ratio  of  the  geometric  to  the  arithmetic  mean  of  the  power  spectrum  [5], 
and  is  given  in  dB.  In  the  ^-th  subband,  the  parameter  can  be  redefined  as  follows: 

SFA/«=  10  logic  G«/4'>. 

Hence,  the  vector  K"  can  be  described  in  the  following  way: 

hL 

y”  =[»=Mi,fc  ...  SFMt,k  ...  SFMs^kf  and:  SFMt,^  =-•  .  (4) 

^  l=(k-vyL+i 


Application  of  the  Unpredictability  Measure.  Introducing  denotations  of  the  spec¬ 
tral  magnitude  prediction  rf^^  and  the  phase  prediction  of  the  /-th  spectral 
component  on  the  basis  of  the  their  last  two  real  values  as  below: 

(5) 


and 


the  unpredictability  measure  cP  is  defined  as  the  Euclidean  distance  between  the 
real  values  of  rp ,  ^P  and  the  predicted  ones  (rp ,  ^p )  as  in  the  formula  [1]: 


(6) 


In  such  a  case,  the  /:-th  element  of  the  vector  •••  •••  ^B,kY  js 

averaged  upon  last  L  flames  and  within  the  b-th  critical  band  in  the  following  way: 


1 

where: 


l=(k-iyL+l 


upper{b) 

cf=— i — 

countib) 


(7) 


where  upperib)  and  lower{b)  denote  indexes  of  the  first  and  the  last  spectral  bin  in 
the  A-th  subband  which  contains  count{b)  components. 
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4.2  Decision  Systems 

The  Decision  Systems  module  (Fig.  3)  is  fed  by  the  spectral  representation  of  the 
noisy  signal  Y(ja>),  First,  the  input  signal  is  processed  in  the  Extraction  of  Pa¬ 
rameters  block  which  task  is  to  obtain  a  vector  of  parameters  ,  and  these  pa¬ 
rameters  are  expected  to  be  mostly  related  to  the  noisy  character  of  the  input  YQco). 
Therefore  elements  of  K/  are  defined  by  analogy  to  the  key  vector  K"  elements, 

and  computed  as  in  the  formula  (4)  or  (7).  The  vector  Vf  is  next  supplied  to  the 
Decision  System  I  which  objective  is  to  give  the  index  value  of  the  noise  spectrum 
N j  that  should  be  most  correlated  to  the  noise  present  in  the  noisy  signal  YQco) . 

Having  received  the  desired  vector  Nj  from  the  Table  of  Vectors,  this  vector  is 
compared  with  the  spectral  representation  YQg))  in  the  Decision  System  II  which 
produces  two  output  sets:  the  set  U  of  useful  and  the  set  D  of  useless  components. 


Fig.  3.  Scheme  of  the  Decision  Module 


Implementation  of  the  Decision  System  L  Decision-making  in  the  Decision  System 
I  can  be  based  on  the  rough  sets  or  neural  reasoning,  and  the  run  of  the  system  can 
be  divided  into  two  modes:  the  training-  and  the  execution  one.  In  the  first  case,  the 
content  of  the  Table  of  Vectors  is  exploited,  which  is  depicted  in  Fig.  3  with  dashed 
lines.  The  intelligent  decision  algorithms  are  considered  fiuther  in  the  paragraph. 


Application  of  Rough  Sets.  In  the  training  mode,  related  to  rule  discovery,  a  part  of 
the  Tables  of  Vectors  is  treated  as  a  decision  table,  where  elements  of  the  key  vector 

k/  defined  by  the  formulas  (4)  or  (7)  are  conditional  attributes  and  the  vector’s 
index  in  the  Table  of  Vectors  is  a  decision  attribute.  Therefore  the  A:-th  object  in  the 
table  is  described  by  the  following  relationship: 

SFMi^k  V..  SFMsy  =>  k  or:  =>k  (8) 

where  the  variables:  SFM,  C  are  given  as  in  the  formulae  (4)  and  (7),  respectively. 

The  rule  discovery  procedure  is  based  on  the  rough  set  principles  [7]  and  algo¬ 
rithm  [2].  It  can  be  noticed  that  only  conditional  attributes  require  quantization.  So 
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far,  only  the  uniform  quantization  has  been  used.  In  the  execution  mode,  the  input 
vector  of  noisy  audio  parameters  is  quantized,  and  next  processed  by  the  set  of 

A 

generated  rules.  In  result,  the  index  value  of  the  noisy  spectrum  Nj  in  the  Tables  of 
Vectors  is  obtained. 

Application  of  Neural  Networks.  The  applied  neural  network  is  a  classic  feedforward 
structure  with  hidden  neurons  [10].  In  the  preliminary  experiments,  only  one  hidden 
layer  was  considered.  The  number  of  input  units  is  equal  to  the  number  of  elements 

in  the  vector  K" ,  given  by  eq.  (4)  or  (7).  There  is  only  one  output  neuron  which 
produces  the  entry  value  {index  value)  to  the  Table  of  Vectors.  For  all  input  and 
hidden  units,  the  activation  function  is  sigmoidal.  However,  since  the  index  value  is 
an  integer,  this  function  for  the  output  unit  is  considered  as  sigmoidal  or  linear. 

In  the  training  mode,  only  a  part  of  the  Table  of  Vectors  is  used  in  the  training 

set:  the  key  vector  is  an  input  vector,  whereas  its  index  in  the  table  is  a  desired 
output,  associated  with  this  vector.  Thus,  the  training  set  is  as  follows: 

. (vt.K)  .  (’) 

As  a  training  method,  the  standard  Error  Backpropagation  Training  Algorithm 
[10]  is  applied,  and  the  error  measure  is  based  on  the  mean  squared  error.  If  the 
neuron’s  activation  ftmction  of  the  output  neuron  is  sigmoidal,  the  desired  index 
value  should  be  scaled  down  in  order  to  match  the  interval  [0,1]. 

In  the  execution  mode,  the  network’s  output  (related  to  the  index  value)  must  be 
rounded  up  to  the  nearest  integer.  Additionally,  if  the  output  unit  processes  accord¬ 
ing  to  sigmoidal  function,  before  the  round-up  the  neuron’s  output  must  be  scaled  up 
by  the  same  factor  by  which  was  scaled  down. 

Implementation  of  the  Decision  System  IL  In  the  Decision  System  II,  the  division 
into  useful  and  useless  elements  is  executed  according  to  the  following  simple  proce¬ 
dure.  All  these  components  which  spectral  powers  Y  exceed  the  double  average  value 
of  the  representative  noise  estimation  N j  are  assumed  to  be  the  useful  elements.  In 

turn,  the  remaining  components  are  regarded  as  useless  ones.  Hence,  in  the  case  of 
use  of  the  V-point  DFT  the  sets  U  and  D  can  defined  as  follows: 

U  = 

D  = 

In  general,  there  can  be  various  methods  of  such  a  division  [2],  and  a  choice  of 
one  of  them  has  significant  influence  on  the  subjective  quality  of  a  restored  audio. 


nJ„:Y„^2^N„j  and  «  =  l,...,V/2 
nJ„:Y„<2-N„j  and  =  l,...,V/2 


106 


4.3  Perceptual  Noise  Reduction  Module 

The  task  of  the  module  is  to  process  the  spectral  representation  of  the  noisy  signal  as 
follows.  All  useful  spectral  components  are  reduced  according  to  the  spectral  sub¬ 
traction  principles  [9],  whereas  Ae  remaining  useless  components  are  masked  using 
the  psychoacoustic  approach.  This  perceptual  approach  is  a  separate  complex  issue 
which  is  not  related  to  the  application  of  intelligent  tools,  and  due  to  the  space  limi¬ 
tations  it  is  not  described  in  this  paper.  However,  the  applied  perceptual  models, 
engineered  methods  together  with  appropriate  algorithms  were  presented  extensively 
in  details  in  the  recent  publications  of  the  authors  [4]  [6]. 


5  Experiments 

There  were  two  objectives  of  the  experiments:  verification  of  the  engineered  method 
for  non-stationary  noise  reduction  and  comparison  of  different  decision  algorithms. 
Some  verification  tests  were  carried  out  first,  in  order  to  check,  whether  application 
of  intelligent  tools  could  improve  the  quality  of  restored  audio  signals.  The  results 
were  encouraging  enough,  for  the  next  comparison  experiments  to  be  prepared. 

In  order  to  assess  the  quality  of  a  decision  algorithm,  the  result  of  such  an  algo¬ 
rithm  should  be  compared  with  the  desired  output.  In  the  case  of  the  research,  it  was 
necessary  to  know  whether  the  noise  spectrum  pointed  by  the  decision  system  was 
the  best  choice,  and  if  not,  an  error  measure  was  needed. 

For  the  purpose  of  the  comparison  experiments,  two  recordings  were  made:  a 
male  voice  (5.81  s)  and  a  non-stationaiy  noise  (2.79  s)  taken  fi-om  a  radio  channel. 
Next,  the  original  audio  was  corrupted  by  the  additive  noise,  and  in  the  same  time, 
elements  of  the  key  vectors  of  the  noise  were  computed  and  collected.  Since  the 
original  audio  and  Ae  noise  were  given,  it  was  known,  which  part  of  the  noisy  voice 
was  described  by  which  key  vector  and  noise  spectrum  vector.  The  parameters  of 
these  recordings  were  as  follows.  They  both  were  mono,  sampled  with  16-bit  resolu¬ 
tion  and  with  the  frequency  equalled  to  8192  Hz,  which  resulted  in  B  =  18  critical 
bands.  During  the  audio  processing,  the  signals  were  divided  into  frames  and  over¬ 
lapped.  Since  the  Hamming  window  function  was  used,  the  overlap  size  was  the  half 
of  Ae  frame  length  N.  In  the  experiments  three  values  of  N  are  used:  128,  256,  512, 
and  their  influence  on  the  time-  and  frequency  resolution  are  shown  in  Tab.  1.  Addi¬ 
tionally,  in  Tab.  1  there  is  a  number  of  the  noise  key  vectors,  assuming  that  signal 
are  averaged  on  last  Z,  =  4  frames.  It  can  be  notices  that  the  number  of  these  vectors 
is  also  the  number  of  objects  in  a  decision  table  or  the  number  of  training  vectors. 

Table  1.  Influence  of  frame  size  N  on  time-  and  frequency  resolution  and  training  set  contents 


N 

Time-resolution 

frequency  resolution 

Number  of  vectors 

128 

7.83  ms 

64  Hz 

88 

256 

15.63  ms 

32  Hz 

44 

512 

31.25  ms 

16  Hz 

22 
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The  comparison  tests  were  divided  with  respect  to  the  following  variables: 

•  various  frame  size:  N  -  128,  N  -  256,  =  5 12 

•  various  key  vector  types:  based  on  the  SFM  parameters  (4)  and  the  unpredictabil¬ 
ity  measure  (7) 

•  various  quantization  steps:  0.1, 0.5, 1 

•  various  number  of  hidden  neurons:  10, 15,  20 

•  various  neuron’s  activation  hmction  of  the  last  unit:  sigmoidal,  linear 

Hence,  a  single  test  attempt  can  be  described  by  a  set  of  parameters  which  are 
valid  for  a  given  decision  algorithm.  Thus,  the  following  denotations  are  proposed: 
(iV,  vector^  RS,  quantization  step)  for  application  of  rough  sets,  and  {N,  vector,  NN, 
hidden  neurons,  output  unit)  for  application  of  neiual  nets.  Totally,  there  are  32 
tests,  such  as  the  following  exemplary  ones:  (512,SPM,NN,10,sigmoidal), 
(512,C,NN,10,sigmoidal),  (512,5FA/,RS,0.5). 

In  order  to  assess  the  quality  of  a  decision  system,  the  error  measure  E  is  intro¬ 
duced.  This  measure  is  the  average  value  of  the  errors  produced  for  all  I  frames 
of  the  noisy  signal,  and  is  defined  according  to  the  following  expression: 


£  =  and  , 


(11) 


1=1 


where  is  related  to  the  vector  of  parameters  of  the  noisy  audio,  and  is 

the  ft-th  element  of  the  key  vector,  placed  at  position  index  in  the  Table  of  Vectors. 
By  analogy,  the  optimal  error  measure  ^opt  and  the  maximum  error  can  be 

introduced.  Assuming  that  the  ;-th  frame  of  the  audio  is  corrupted  by  the  noise  de¬ 
scribed  by  the y-th  vector  in  the  Table  of  Vectors,  these  measures  is  defined  as  below: 


sf 

II 

cl 

and 

j=i 

b=l 

,  / 

^max=y^£SL  and 
1=1 

=  max 

_b=l 

where  K  is  the  number  of  vectors  in  the  Table  of  Vectors. 
Finally,  the  quality  coefficient  q  is  introduced  as  follows: 


e  =  l“ 


^  ^opt 
-^max  “-^opt 


(12) 


(13) 


(14) 


For  the  completed  surveys  this  quality  coefficient  is  as  follows: 

•  (5 12, S'FMjNN,  10, sigmoidal):  ^  =  81.38  % 

•  (512, C,NN, 10, sigmoidal):  q  =  85.21  % 

•  (5l2,SFM,KS,0.5y.  q  =  78.43  % 


108 


It  can  be  assumed  that  the  best  result  for  the  test  (5 1 2, C,NN,  10, sigmoidal)  is  due 
to  the  use  of  the  more  precise  key  vector  type  (based  on  the  unpredictability  meas¬ 
ure),  In  turn,  the  worst  result  obtained  with  rough  sets  can  be  caused  mainly  by  the 
inefficient  quantization  step. 


6  Conclusions 

In  the  paper,  the  engineered  system  for  non-stationaiy  noise  reduction  has  been  pre¬ 
sented,  which  exploits  reasoning  based  on  neural  processing  and  rough  sets.  A  num¬ 
ber  of  experiments  have  been  carried  out  in  order  to  assess  the  quality  of  particular 
decision  ^sterns.  The  results  of  experiments  show  that  computational  intelligence 
and  soft  computing  algorithms  are  applicable  to  the  control  of  perceptual  coding 
algorithms  applied  to  noise  reduction. 


References 

1  Brandenburg,  K.:  Second  Generation  Perceptual  Audio  Coding:  The  Hybrid  Coder.  Proc.  of 
the  90th  Audio  Eng.  Soc.  Conv.,  Monteux,  France,  (1990)  preprint  2937 

2.  Czyzewski,  A.:  Learning  Algorithms  for  Audio  Signal  Enhancement.  J.  of  Audio  Eng.  Soc. 
45(1997)  931-943 

3.  Czyzewski,  A.,  Krolikowski,  R.:  Application  of  Intelligent  Decision  Systems  to  the  Per¬ 
ceptual  Noise  Reduction  of  Audio  Signals.  Proc.  of  the  5th  European  Congress  on  Intelli¬ 
gent  Techniques  and  Soft  Computing,  Aachen,  Germany  1  (1997)  188-192 

4.  Czyzewski,  A.,  Krolikowski,  R.:  Noise  Reduction  in  Audio  Employing  Auditory  Masking 
Approach.  Proc.  of  106th  Audio  Eng.  Soc.  Conv.,  Munich,  Germany,  (1999)  preprint  4930 

5.  Johnston,  J.:  Transform  Coding  of  Audio  Signals  Using  Perceptual  Noise  Criteria.  IEEE  J. 
of  Select.  Areas  in  Communication  6(1988)31 4-323 

6.  Krolikowski,  R.,  Czyzewski,  A.:  Noise  Reduction  in  Acoustic  Signals  Using  the  Perceptual 
Coding.  Proceedings  of  the  137th  Meeting  Acoustic  Society  of  America,  Berlin,  Germany, 
(1999)  CD-Preprint 

7.  Pawlak,  Z.:  Rough  Sets.  International  Journal  of  Computer  and  Information  Sciences  11 
(1982)  341-356 

8.  Shlien,  S.:  Guide  to  MPEG-1  Audio  Standard.  IEEE  Trans.  Broadcasting  40  (1994)  206- 
218 

9.  Vaseghi,  S.:  Advanced  Signal  Processing  and  Digital  Noise  Reduction.  Wiley  &  Teubner, 
New  York  (1997) 

10.  Zurada,  J.:  Introduction  to  Artificial  Neural  Networks.  West  Publ.  Comp.,  St.  Paul  (1992) 

11.  Zwicker,  E.,  Fasti,  H.:  Psychoacoustics,  Facts  and  Models.  Springer  Verlag,  Berlin  (1990) 


Acknowledgements 

The  research  is  sponsored  by  the  State  Committee  for  Scientific  Research,  Warsaw, 
Poland.  Grant  No.  8  T1  ID  021  12. 


Rough  Set  Analysis  of  Electrostimulation  Test  Database 
for  the  Prediction  of  Post-Operative  Profits  in  Cochlear 
Implanted  Patients 


Andrzej  Czyzewski,  Heniyk  Skarzynski,  Bozena  Kostek  and  Rafal  Krolikowski 

Institute  of  Physiology  and  Pathology  of  Hearing, 
ul.  Pstrowskiego  1, 00-943  Warsaw,  Poland 

kid@sound.eti.pg.gda.pl 


Abstract.  A  new  method  of  examining  the  hearing  nerve  in  deaf  people  has 
been  developed  at  the  Institute  of  Physiology  and  Pathology  of  Hearing  in 
Warsaw.  It  consists  in  testing  deaf  people  by  speech  signal  delivered  through  a 
ball  shaped  microelectrode  connected  to  the  modulated  current  source  and  at¬ 
tached  to  the  promontory  area.  The  electric  current  delivered  to  the  ball  shaped 
electrode  is  modulated  with  real  speech  signal  >\liich  is  transposed  downwards 
the  frequency  scale.  A  computer  database  of  patients’  data  and  electrostimula¬ 
tion  test  results  has  been  created.  This  database  was  analyzed  using  the  rough 
set  method  in  order  to  find  rules  allowing  prediction  of  hearing  recovery  of 
cochlear  implantation  candidates.  The  Rough  Set  Class  Library  (RSCL)  has 
been  developed  in  order  to  implement  data  mining  procedures  to  the  engi¬ 
neered  database  of  electrostimidation  test  results.  The  RSC  Library  supports 
symbolic  approach  to  data  processing.  Additionally,  the  library  is  equipped 
with  a  set  of  data  quantization  methods  that  may  be  a  part  of  an  interface  be¬ 
tween  external  data  environment  and  the  rough  set-based  kernel  of  the  system. 
The  results  of  studies  in  the  domain  of  prediction  of  post-operative  profits  of 
deaf  patients  based  on  the  rough  set  analysis  of  electrostimulation  test  data¬ 
base  are  presented  and  discussed  in  the  paper. 


1  Introduction 

The  electrostimulation  tests  are  treated  as  an  important  tool  in  preoperative  diagnos¬ 
tics  of  deaf  people  who  are  candidates  for  the  cochlear  implantation.  The  idea  of 
electrical  stimulation,  which  goes  back  to  A.  Volta,  is  to  electrically  stimulate  termi¬ 
nations  of  the  fibers  of  the  auditoiy  nerve  in  order  to  evoke  an  auditory  sensation  in 
the  central  nervous  system.  The  electrical  stimulation  is  the  application  of  electrical 
current  to  the  audiovestibular  nerve  in  order  to  assess  its  integrity  [1].  It  can  be  per¬ 
formed  applying  either  invasive  or  non-invasive  technique.  Using  invasive  technique 
a  promontory  needle-electrode  is  applied  transtympanically  [3], [8]  or  ball-shaped 
electrode  is  placed  in  the  round  window  niche  following  a  transcanal  tympanomeatal 
incision  and  removal  of  the  bony  overhang  overlying  the  round  window  membrane 
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[4].  A  non-invasive  alternative  is  extratympanic  ear  canal  test  [6]  [7]  [9].  During  this 
test  the  metal  electrode  is  inserted  into  the  ear  canal,  which  has  been  filled  with 
physiological  saline  solution  [2].  It  is  assumed  that  the  non-invasive  attachment 
considerably  simplifies  preoperative  electrostimulation.  This  is  especially  important 
when  testing  children  is  considered  [5]. 

The  results  of  electrical  stimulation  tests  are  highly  diversified,  depending  on  a  pa¬ 
tient’s  age  and  condition,  thus  there  was  a  need  to  create  a  database  provided  with 
adequate  tools  allowing  to  find  and  to  study  the  correlation  between  obtained  test 
results  and  patients*  ability  to  receive  and  understand  sounds.  However,  in  order  to 
achieve  the  best  perspective  in  cochlear  implantation  it  is  necessary  not  only  to  diag¬ 
nose  properly  the  auditory  nerve  status,  but  at  the  same  time  to  evaluate  the  future 
benefits  of  the  cochlear  implant  to  the  patient.  The  procedure  developed  at  the  In¬ 
stitute  of  Physiology  and  Pathology  of  Hearing  in  Warsaw  allows  determining  some 
vital  characteristics  of  the  hearing  sense  that  help  to  make  decisions  regarding  the 
cochlear  implantation  [11].  The  testing  based  on  the  electrical  stimulation  via  the 
external  auditory  canal  filled  with  saline  is  performed  using  the  ball  shaped  electrode 
and  the  spectral  transposition  of  speech  signal.  Moreover,  patients’  data  such  as 
personal  ^ta  and  health  history  are  included  in  the  database.  The  rough  set  algo¬ 
rithm  was  engineered  enabling  an  analysis  of  highly  diversified  database  records  in 
order  to  find  some  dependencies  between  data  and  rnaking  possible  to  predict  results 
of  cochlear  implantation  basing  on  the  results  obtained  previously  in  other  patients. 


2  Method  of  Speech  Processing 

The  method  of  spectral  transposition  of  speech  signal  was  engineered  earlier  for  the 
use  in  special  hearing  aids  applied  in  the  profound  hearing  loss  [10].  Some  essential 
modifications  were  introduced  to  the  previously  designed  algorithm  in  order  to  adjust 
it  to  the  current  needs. 

The  simplified  scheme  of  the  algorithm  of  spectral  transposition  of  speech  signals 
is  shown  in  Fig.  1.  It  can  be  noticed  that  the  structure  of  the  algorithm  is  based  on 
the  voice  coder  (vocoder).  The  natural  speech  signal  is  delivered  to  the  transposer. 
Since  the  energy  of  sounds  lowers  upwards  the  frequency  scale,  the  signal  is  preem¬ 
phasized  by  6  dB/oct.  By  analogy,  the  transposed  signal  is  deemphasized  at  the  out¬ 
put  by  the  same  ratio,  i.e.  6  dB/oct.  Additionally,  in  order  to  get  rid  of  some  distur¬ 
bances  that  may  occur  while  manipulating  the  signal,  a  low-pass  filter  is  applied. 
The  deemphasized  signal  is  compressed  in  the  Compressor  module,  because  of  the 
serious  limitation  of  the  dynamic  ratio  of  signals  received  by  the  electrically  stimu¬ 
lated  auditory  nerve. 

The  detection  of  voiced  speech  portions  is  based  on  the  cepstrum  analyzis 
method.  When  voiced  sounds  are  pronounced,  a  Fimdamental  Frequency  Generator 
is  activated.  In  such  a  case,  the  synthesized  sound  is  a  result  of  a  convolution  of  a 
periodic  stimulus  and  the  impulse  response  of  a  vocal  tract  represented  by  spectral 
envelopes.  The  detected  vocal  tone  fi*equency  is  then  divided  by  a  factor  selected 
fi‘om  the  range  of  1.5  to  3,  depending  on  the  width  of  patient's  auditory  nerve  re- 
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sponse  frequency  band.  The  resynthesis  of  speech  with  the  lower  vocal  tone  fre¬ 
quency  allows  to  maintain  speech  formants  in  the  low  frequency  band  related  to 
patients’  auditoiy  nerve  sensitivity  characteristics.  The  above  procedures  make  pos¬ 
sible  to  resynthesize  speech  in  the  lower  band  of  the  frequency  scale  in  such  a  way 
that  the  formant  frequency  ratios  are  preserved.  It  helps  to  maintain  the  synthetic 
speech  intelligible,  though  it  is  received  within  a  narrow  frequency  band  only. 


Input 


Bank  of  10  ban^ass  filters  Bank  of  10  bandpass  filters  related  to  the  auditoiy 

related  to  critical  bands  frequency  response  (programmable  filtos) 


Fig.  1.  Oeneral  lay-out  of  the  noise  reduction  system 


3  Examination  of  the  Patients 

A  set  of  charts  is  prepared  in  order  to  facilitate  co-operation  with  patients.  The  charts 
include  responses,  such  as  auditory  sensation  received  by  the  patient:  soft,  comfort¬ 
able,  loud,  very  loud,  etc.  The  charts  also  include  information  on  the  type  of  received 
signal.  During  the  examination,  three  standard  tests  are  performed: 

•  TMU  -  dynamics  range  defined  by  the  auditoiy  threshold  (THR)  and  uncomfort¬ 
able  loudness 

•  TDL  -  Time  Difference  Limen  test 

•  TID  -  test  of  frequency  differentiation 

In  the  TMU  test,  the  values  of  intensity  in  [|iA]  of  the  electrical  stimuli  evoking 
an  auditoiy  response  are  determined.  In  the  TDL  test,  the  time  of  subjecting  to  the 
stimulus  is  differentiated.  The  first  step  is  to  determine  the  level  of  comfortable 
hearing  (MCL)  for  the  patient  for  a  stimulus  of  125  Hz  (or  62.5  Hz).  If  the  dynamics 
in  this  range  of  frequencies  is  as  high  as  10  dB,  this  result  is  recognized  as  good. 
Next,  the  patient  listens  to  three  sounds,  of  which  one  is  longer  and  two  are  shorter. 
The  purpose  of  this  test  is  to  find  whether  the  patient  is  capable  of  differentiating  the 
sequence  in  which  the  soimds  are  given.  The  difference  in  the  duration  of  the  long 
and  short  sound  changes,  depending  on  the  patient's  response.  The  result  of  this  test 
is  given  in  miliseconds  [ms]. 


If  the  result  obtained  by  the  patient  is  less  than  120  ms,  this  means  that  the  patient 
can  recognize  time  relations  well.  It  gives  a  good  perspective  for  a  patient’s  achiev¬ 
ing  speech  comprehension  in  an  acoustic  manner.  In  the  next  test  (TID)  -  the  fre¬ 
quency  of  the  stimulus  given  is  differentiated.  This  test  is  done  in  three  frequencies: 
31.25  Hz,  62.5  Hz,  125  Hz.  For  these  frequencies  the  level  of  the  most  comfortable 
hearing  is  determined.  Three  different  sounds  are  demonstrated  to  the  patient:  I,  II, 
in  corresponding  to  the  above  frequencies.  Next,  after  the  sound  is  given,  the  patient 
tells  which  sound  he  or  she  has  heard:  I,  II  or  III.  To  come  up  with  a  statistically 
valid  and  credible  result,  the  examination  has  to  be  repeated  at  least  three  times. 

The  fourth  measurement  is  the  difference  between  ULL,  which  is  the  level  of 
feeling  discomfort  or  the  level  of  feeling  pain,  and  THR  -  which  is  the  lowest  thresh¬ 
old  of  the  stimulus  that  can  be  heard.  These  differences  indicate  the  range  of  dy¬ 
namics  defined  in  the  decibel  scale  [dB].  It  is  a  very  important  measurement  because 
it  defines  the  ability  of  the  auditory  nerve  to  be  electrically  excited  and  gives  a  prog¬ 
nosis  as  to  the  postoperative  effects.  According  to  results  presented  in  the  literature, 
(Ramies  exceeding  13  dB  guarantees  good  results  of  postoperative  rehabilitation. 


4  Database  of  Electrostimulation  Test  Results 

In  order  to  evaluate  the  results  obtained  in  preoperative  electrostimulation  tests  a  set 
of  examining  techniques  is  used  after  the  cochlear  implantation.  These  are:  screen¬ 
ing  tests  and  auditory  speech  perceptiori,  recognition  and  identification  tests,  the 
latter  consisting  in  using  various  speech  elements,  such  as  single  words,  vowels, 
monosyllable,  onomatopoeias,  etc.  This  aimed  at  assigning  a  correlation  between 
preoperative  and  postoperative  test  results.  For  this  purpose  the  mentioned  database 
containing  results  obtained  by  more  than  150  implanted  patients  has  been  created  at 
the  Institute  of  Physiology  and  Pathology  of  Hearing  in  Warsaw.  It  includes  also 
personal  data  and  some  additional  factors  pertaining  educational  and  social  skills,  a 
degree  of  motivation,  how  early  a  hearing  aid  was  prescribed  that  provides  constant 
acoustic  stimulation  of  the  auditory  system,  etc. 

The  created  database  has  been  designed  for  testing  by  techniques  recognized  as 
data  mining  or  soft  computing.  These  techniques  are  very  valuable  in  clinical  diag¬ 
nostics  because  they  can  trace  some  hidden  relations  between  data,  not  visible  in  the 
case  when  patients'  data  are  not  complete  or  there  are  many  records  included  in  the 
database.  The  database  addresses  the  following  issues:  a)  personal  data;  b)  cause  of 
deafness;  c)  kind  of  deafness  (prelingual,  perilingual,  postlingual );  d)  time  passed 
since  deafness;  e)  dates  of  examinations;  f)  number  of  tests  performed;  g)  found 
^amic  range  of  hearing  nerve  sensitivity  [dB];  h)  found  fi*equency  band  of  hearing 
nerve  sensitivity  [Hz];  i)  TMU  measurement;  j)  TDL  test  result;  k)  TID  test  result;  1) 
some  factors  which  may  influence  the  results  of  electrostimulation  tests  (e.g.:  pro¬ 
gressive  hearing  loss,  acoustic  trauma,  use  of  hearing  aids,  ...);  m)  use  of  transposi¬ 
tion  during  the  voice  communication;  n)  patient's  motivation;  o)  patient  social  skills; 
p)  vowels  recognition  ability;  q)  monosyllable  recognition  ability;  r)  onomatopoeias 
recognition  ability;  s)  simple  commands  recognition  ability. 
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The  patients  are  divided  to  the  two  ranges  of  age:  0-18  years;  more  than  18  years. 
There  are  5  groups  of  patients  distinguished  with  regard  of  the  time  passed  since 
deafiiess:  less  than  1  year;  1-5  years;  5-10  years;  10-20  years;  more  than  20  years. 

As  is  easy  to  observe  in  above  list,  the  database  contains  highly  diversified  infor¬ 
mation,  namely  text  strings  (a-b),  integers  (c-h),  real  numbers  (i-k),  binary  flags  (1- 
m),  and  grades  (n-s).  The  processing  of  such  of  information  can  be  done  efficiently 
by  the  algorithm  based  on  the  rough  set  method.  The  results  of  measurements  (g-k) 
need  to  be  quantized  automatically  by  some  adequate  algorithms. 


5  Rough  Set  Analysis  of  Electrostimulation  Database 

The  library  of  rough  set  procedures  was  engineered  at  the  Sound  Engineering  De¬ 
partment  of  the  Technical  University  of  Gdansk.  This  library  makes  possible  to  in¬ 
clude  some  rough  set  data  analysis  procedures  in  the  engineered  database  software.  A 
description  of  the  rough  set  class  library  is  given  below  in  this  chapter. 


5.1  Rough  Set  Class  Library 

The  Rough  Set  Class  Library  is  an  object-oriented  library  of  procedures/functions 
which  goal  is  to  process  data  according  to  principles  of  the  rough  set  theory.  The 
RSCL  takes  all  necessary  actions  related  to  data  mining  and  knowledge  discovery. 
The  engineered  library  is  designed  to  run  in  the  DOSAVindows  envirormient  and 
compiled  with  the  use  of  Borland^^  C++  Compiler  v.  3. 1. 

In  general,  implemented  functions  in  RSCL  comprise  the  following  tasks; 

•  rule  induction 

•  processing  of  rules 

•  fundamental  operations  on  rough  sets:  partition  of  the  universe  into  equivalence 
classes  (i.e.  sets  of  objects  indiscernible  with  respect  to  a  given  set),  calculation  of 
lower  and  upper  approximation,  calculation  of  boundary  region,  calculation  of 
positive  and  negative  region 

•  supply  of  auxiliary  functions:  showing  the  winning  rule  and  its  rough  measure, 
calculating  number/percentage  of  certain  and  uncertain  rules,  showing  range  of 
the  rough  measure  for  all  rules,  computing  cardinality  of  data  sets. 

The  kernel  of  the  rough-set-based  data  processing  system  works  on  symbolic  data, 
i.e.  symbolic  representations  of  attributes  and  decisions  in  an  information  table.  In 
order  to  facilitate  fitting  the  ^stem’s  kernel  to  external,  the  most  frequent  non- 
symbolic,  data,  some  methods  of  quantization  are  supplied  in  the  library  [12].  These 
discretization  procedures  are  as  follows:  Equal  Interval  Width  Method,  Statistical 
Clustering  Method  and  Maximum  Distance  Method. 
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5.2  The  Rough  Set  Class  Implementation 

In  this  paragraph  is  given  a  brief  description  of  some  RSCL  functions  that  are  related 
to  the  principles  of  rough-set-based  data  processing. 

•  quantization  of  data  in  the  decision  table 

The  function  performs  discretization  of  data  values  in  the  decision  table  accord- 
ing  to  a  given  quantization  method. 

Input:  a  decision  table,  a  quantization  method. 

Output:  the  decision  table  with  quantized  values. 

•  quantization  of  conditional  attributes 

The  function  performs  discretization  of  values  of  the  conditional  attributes  ac¬ 
cording  to  a  given  quantization  method. 

Input:  a  set  of  conditional  attributes,  a  quantization  method. 

Output:  the  set  of  conditional  attributes  in  an  adequate  symbolic  representation. 

•  dequantization  of  decision  attributes 

The  function  executes  the  inverse  action  to  the  quantization  algorithm  -replaces 
the  symbolic  value  of  data  with  adequate  crisp  values. 

Input:  a  symbolic  decision  (decision  attributes),  a  quantization  method. 

Output:  crisp  values  of  the  decision. 

•  rule  induction 

The  function  induces  a  rule  base.  The  action  is  performed  on  the  basis  of  the 
principles  of  the  rough  set  method. 

Input:  a  decision  table. 

Output:  a  table  with  discovered  knowledge  (induced  rules). 

•  processing  of  rules 

The  function  deduces  a  decision  on  the  basis  of  an  event,  i.e.  set  of  conditional 
attributes.  The  rule  base  has  to  be  induced  earlier. 

Input:  an  event  (set  of  conditional  attributes). 

Output:  a  decision  (set  of  decision  attributes). 

•  partition  of  the  universe  into  equivalence  classes 

The  procedure  yields  a  set  of  equivalence  classes.  The  decision  table  is  parti¬ 
tioned  into  some  sets  with  respect  to  a  certain  set  of  attributes. 

Input:  a  decision  table  (the  universe),  a  set  of  conditional  attributes. 

Output:  a  set  of  equivalence  classes. 

•  calculation  of  lower-  and  upper  approximation  and  boundary  region 

The  functions  compute  sets,  which  are  lower-  and  upper  approximations  and  a 
boundary  region  of  a  set  of  objects  with  respect  to  a  certain  set  of  attributes. 

Input:  a  set  of  objects  (in  the  form  of  a  decision  table),  a  set  of  attributes. 

Output:  a  resultant  set  of  objects. 

•  calculation  of  positive  and  negative  region 

The  functions  calculate  positive  or  negative  regions  of  classification  for  a  certain 
set  of  attributes. 

Input:  a  decision  table,  two  sets  of  attributes. 

Output:  a  resultant  set  of  objects. 
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5.3  Rough  Set  Processing  of  Electrostimulation  Data 

The  results  of  electrostimulation  tests  are  collected  in  forms,  separately  for  each 
patient.  Then  they  should  be  transformed  to  the  decision  tables  used  in  the  rough  set 
decision  systems  (Tab.  1)  Therefore,  objects  ti  to  t„  represent  patient  cases.  Attributes 
AitoA„  are  to  be  denoted  as  tested  parameters,  introduced  in  Par.  4  (a-s).  They  are 
used  as  conditional  attributes.  The  data  values  are  defined  by  aji  to  a,^  as  numbers 
or  grades  (quantized  values).  The  decision  D  is  understood  as  a  value  assigned  to  the 
overall  grade  (OVERALL  GRADE).  This  quantity  represents  the  expected  post¬ 
operative  profits  express  in  the  descriptive  scale  as  follows: 

OVERALL  GRADE  =  1  -  meaning:  predicted  hearing  recoveiy  profits  -  none 
OVERALL  GRADE  =  2  -  meaning:  predicted  hearing  recovery  profits  -  low 
OVERALL  GRADE  =  3  -  meaning:  predicted  hearing  recoveiy  profits  -  fair 
OVERALL  GRADE  =  4  -  meaning:  predicted  hearing  recovery  profits  -  well 
OVERALL  GRADE  =  5  -  meaning:  predicted  hearing  recoveiy  profits  -  very  good 


Table  1.  Decision  table  used  in  electrostimulation  database 
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The  engineered  decision  system  employs  both  learning  and  testing  algorithms 
[15].  During  the  first  phase  rules  are  derived  that  from  the  basis  for  the  second  phase 
performance.  The  generation  of  decision  rules  starts  from  rules  of  the  length  equals 
1,  then  the  ^stem  generates  rules  of  the  length  equals  2,  etc.  The  maximum  rule 
length  may  be  defined  by  the  operator.  The  system  induces  both  certain  and  possible 
rules.  It  is  assumed  that  the  rough  set  measure  (/^)  for  possible  rules  should  exceed 
the  value  0.5,  Moreover,  only  such  rules  are  taken  into  account,  that  have  been  pre¬ 
ceded  by  any  shorter  rule  operating  on  the  same  parameters.  The  i^stem  produces 
rules  of  the  following  form: 

(attribute  \  -  value _a\  \ )  and . . .  and  (attribute  = value  a^tn)  =>  (OverallGrade  di) 

The  data  were  gathered  from  all  subjects  during  the  interviews  and  electrostimu¬ 
lation  test  sessions.  Some  exemplary  data  records  are  presented  in  Tab.  2.  Having 
results  of  several  patients,  these  data  are  then  processed  by  the  rough  set  algorithm. 
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Table  2.  Fragments  of  electrostimulation  database  records  (described  in  Par.  4) 
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For  the  discussed  example  (Tab.  2)  the  following  strongest  rules  were  obtained, 
being  in  a  good  accordance  with  the  principles  and  practice  of  audiology: 


JF(g  =  high)  THEN  {OverallGrade  =  very  good)  ^rs  =  1 

IF  (^/  =  high  )  &  (s  =  low)  THEN  {OverallGrade  =  low)  l^RS-  0-^2 

IF  (c  =  1)  &  (g  =  low)  &  {s  =  low)  THEN  {OverallGrade  =  none)  IIrs-  0.8 
W  {b  =  trauma)  &  (c  =  3)  THEN  {OverallGrade  =  well)  ^rs  =  0.76 

IF  (f=  high)  &  (g  =  high)  THEN  {OverallGrade  =  well)  }Xrs  =0.72 


Every  new  patient  record  can  be  tested  using  previously  induced  rules  and  on  this 
basis  a  predictive  diagnosis  of  post-operative  profits  can  be  automatically  provided  by 
the  system.  This  diagnosis  expressed  as  a  grade  value  may  be  used  as  a  supportive  or 
a  contradictory  factor  in  the  process  of  qualification  of  the  deaf  patient  to  cochlear 
implantation.  The  accuracy  of  decisions  produced  by  the  intelligent  database  analysis 
algorithm  is  expected  to  grow  higher  as  the  number  of  patient  records  increase. 


6  Conclusions 

In  this  paper  a  new  procedure  for  testing  candidates  for  hearing  implantation  has 
been  presented.  The  obtained  frequency  range  during  the  electrostimulation  tests 
with  Ae  modified  ball-shaped  electrode  allows  to  deliver  not  only  tones  to  the  audi¬ 
tory  nerve  but  using  the  signal  processing  device  also  speech  signal  can  be  received 
by  some  completely  deaf  patients.  This  may  be  helpful  to  properly  diagnose  and 
qualify  patients  to  cochlear  implantation  and  to  give  the  patients  some  kind  of  sound 
experience  before  the  implantation. 

The  engineered  RSC  Library  procedures  offering  a  symbolic  approach  to  data 
processing  have  been  implemented  in  the  constructed  electrostimulation  test  database 
allowing  to  analyze  data  records  automatically  in  order  to  mine  rules  showing  some 
hidden  dependencies  between  patients’  data  and  the  expected  hearing  recovery  after 
cochlear  implantation. 

The  proposed  method  of  prediction  of  post-operative  results  is  presently  at  the  ex¬ 
perimental  stage  and  requires  some  more  testing  in  order  to  reveal  its  full  potential. 
Nevertheless,  a  diagnosis  provided  by  the  algorithm  may  be  already  used  as  a  sup¬ 
portive  or  a  contradictory  factor  in  the  process  of  qualification  of  deaf  patients  to 
cochlear  implantation. 
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Abstract.  A  non-trivial  obstacle  in  good  text  classification  for  inform¬ 
ation  filtering  and  retrieval  (IF/IR)  is  the  dimensionality  of  the  data. 
This  paper  proposes  a  technique  using  Rough  Set  Theory  to  alleviate 
this  situation.  Given  corpora  of  documents  and  a  training  set  of  ex¬ 
amples  of  classified  documents,  the  technique  locates  a  minimal  set  of 
co-ordinate  keywords  to  distinguish  between  classes  of  documents,  redu¬ 
cing  the  dimensionality  of  the  keyword  vectors.  This  simplifies  the  cre¬ 
ation  of  knowledge-based  IF/IR  systems,  speeds  up  their  operation,  and 
allows  easy  editing  of  the  rule  bases  employed.  The  paper  describes  the 
proposed  technique,  discusses  the  integration  of  a  keyword  acquisition 
algorithm  with  a  rough  set-based  dimensionality  reduction  algorithm, 
and  provides  experimental  results  of  a  proof-of-concept  implementation. 


1  Introduction 

Information  Filtering  (IF)  and  Information  Retrieval  (IR)  is  rapidly  acquiring 
importance  as  the  volume  of  electronically  stored  information  explodes.  Text 
classification  is  an  important  part  of  information  filtering  in  that  it  categorises 
documents  within  text  corpora.  The  user  may  then  handle  the  various  classes 
of  documents  in  different  ways  and  devote  attention  only  to  those  classes  that 
merit  it.  For  instance,  an  E-mail  classification  system  could  divide  incoming  mail 
into  business-related  messages,  personal  messages  and  useless,  unsolicited  mail 
to  be  deleted  automatically. 

However,  a  non-trivial  obstacle  in  good  text  classification  is  the  dimension¬ 
ality  of  the  data.  In  most  IF/IR  techniques,  each  document  is  described  by  a 
vector  of  extremely  high  dimensionality  —  typically  one  value  per  word  or  pair 
of  words  in  the  document  [1].  The  vector  ordinates  are  used  as  preconditions  to  a 
rule  which  decides  what  class  the  document  belongs  to.  Document  vectors  com¬ 
monly  comprise  tens  of  thousands  of  dimensions  [2] ,  which  renders  the  problem 
all  but  intractable  for  even  the  most  powerful  computers.  The  use  of  the  cosine 
of  the  angle  between  two  vectors  [3]  as  a  comparison  metric  further  increases  the 
number  of  operations  to  be  performed  for  the  classification  of  one  document. 

This  paper  proposes  a  technique  using  Rough  Set  Theory  [4]  that  can  help 
cope  with  this  situation.  Given  corpora  of  documents  and  a  set  of  examples 
of  classified  documents,  the  technique  can  quickly  locate  a  minimal  set  of  co¬ 
ordinate  keywords  to  distinguish  between  classes  of  documents.  As  a  result,  it 
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dramatically  reduces  the  dimensionality  of  the  keyword  space.  The  resulting  set 
of  keywords  (or  preconditions)  is  typically  small  enough  to  be  understood  by  a 
human.  This  simplifies  the  creation  of  knowledge-based  IF/IR  systems,  allowing 
easy  editing  of  the  rule  bases. 

The  background  of  text  classification  and  Rough  Set  theory  is  discussed  in 
Sect.  2.  Section  3  provides  a  description  of  the  proposed  system.  Section  4  de¬ 
scribes  experimental  results.  The  paper  is  concluded  in  Sect.  5. 


2  Background 

2.1  The  Rough  Set  Attribute  Reduction  Method  (RSAR) 

Rough  set  theory  [4]  is  a  flexible,  formal  mathematical  tool  that  can  be  applied  to 
reducing  the  dimensionality  of  datasets.  RSAR  removes  redundant  input  attrib¬ 
utes  from  datasets  of  nominal  values,  while  ensuring  that  no  information  essential 
for  the  task  at  hand  is  lost.  The  approach  is  very  efficient,  taking  advantage  of 
conventional  Set  Theory  operations.  It  works  by  maximising  a  quantity  known 
as  degree  of  dependency.  The  degree  of  dependency  ^p{X)  of  a  set  Y  of  decision 
attributes  on  a  set  of  conditional  attributes  X  provides  a  measure  of  how  import¬ 
ant  that  set  of  decision  attributes  is  in  classifying  the  dataset  examples  into  the 
classes  in  Y.  If  7x(>^)  =  0,  then  classification  Y  is  independent  of  the  attributes 
in  X,  hence  the  decision  attributes  are  of  no  use  to  this  classification.  If  7  =  1, 
then  Y  is  completely  dependent  on  X,  hence  the  attributes  are  indispensable. 
Values  0  <  ^x{Y)  <  1  denote  partial  dependency,  which  shows  that  only  some 
of  the  attributes  in  X  may  be  useful,  or  that  the  dataset  was  flawed  to  begin 
with. 

To  calculate  7x(V),  it  is  necessary  to  define  the  indiscernibility  relation. 
Given  a  subset  of  the  set  of  attributes,  P  C  A,  two  objects  x  and  y  in  a  dataset 
U  are  indiscernible  with  respect  to  P  if  and  only  if  f{x,Q)  ~  f{y,Q)  V  q  Cf 
(where  /(a,  B)  is  the  classification  function  represented  in  the  dataset,  returning 
the  classification  of  object  a  using  the  conditional  attributes  contained  in  the  set 
B).  The  indiscernibility  relation  for  all  P  6  A  is  written  as  IND(P).  U/IND(P) 
is  used  to  denote  the  partition  of  U  given  IND(P): 


U/IND(P)  =  0  {9  e  P  :  U/IND(9)}  (1) 

where  the  operator  0,  as  applied  to  two  sets  of  sets  A  and  P,  is  defined  as: 

A^B  =  {xnY:^  Xe  A,  w  Ye  B.xnY^H}}  (2) 

Rough  Sets  approximate  traditional  sets  using  a  pair  of  other  sets,  named  the 
lower  and  upper  approximation  of  the  set  in  question.  The  lower  and  upper 
approximation  of  a  set  P  C  U,  given  an  equivalence  relation  IND(P),  is  defined 
as: 
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PY  =  ]J{X  :  X  G  U/IND(P),X  C  Y}  (3) 

PY  =  [J{X  :  X  G  U/IND(P),X  n  F  #  0}  (4) 

Suppose  that  P  and  Q  are  equivalence  relations  in  U,  the  positive  region  POSp((5) 
contains  all  objects  in  U  that  can  be  classified  in  attributes  Q  using  the  inform¬ 
ation  in  attributes  P  and  is  defined  as: 


POSp(Q)  =  U 

XeQ 

From  this,  the  degree  of  dependency  7p(Q)  is  given  by: 


7p(Q)  = 


II  PQSp(Q)  II 
IIUII 


(6) 


where  ||  Set  ||  is  the  cardinality  of  Set 

The  naive  version  of  the  RSAR  algorithm  evaluates  7p(Q)  for  all  possible 
subsets  of  the  dataset’s  conditional  attributes,  stopping  when  it  either  reaches  1, 
or  there  are  no  more  combinations  to  investigate.  The  QuickReduct  Algorithm 
(described  in  [5, 6])  escapes  the  NP-hard  nature  of  the  naive  version  by  searching 
the  tree  of  attribute  combinations  in  a  depth-first  manner.  It  starts  oflF  with  an 
empty  subset  and  adds  attributes  one  by  one,  each  time  selecting  the  attribute 
whose  addition  to  the  current  subset  will  offer  the  highest  increase  of  7p(0). 
The  algorithm  stops  when  a  7p(Q)  of  1  is  reached,  or  when  all  attributes  have 
been  added  (in  which  case  the  dataset  could  not  be  correctly  classified  to  begin 
with). 

It  is  evident  from  this  that  the  RSAR  will  not  compromise  with  a  set  of 
conditional  attributes  that  contains  large  part  of  the  information  of  the  initial 
set  —  it  will  always  attempt  to  reduce  the  attribute  set  without  losing  any 
information  significant  to  the  classification  at  hand. 

Since  the  QuickReduct  algorithm  builds  conditional  attribute  subsets  in¬ 
crementally,  it  is  possible  to  influence  its  decisions  by  suitable  re-ordering  of  the 
conditional  attributes  in  the  dataset  such  that  attributes  suspected  to  be  of  more 
importance  are  placed  before  (to  the  left  of)  attributes  of  lesser  importance.  This 
is  done  by  the  integrated  system  described  in  this  paper,  in  order  to  suggest  to 
RSAR  to  utilise  the  better,  more  characteristic  keywords  before  others. 


2.2  Text  Classification 

Text  classification  aims  to  separate  groups  (corpora)  of  documents  into  categor¬ 
ies.  Like  all  classification  tasks,  it  may  be  tackled  either  by  comparing  new  doc¬ 
uments  with  previously  classified  ones  (distance-based  techniques),  or  by  using 
rule-based  approaches. 
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Most  text  classification  techniques  that  do  not  operate  at  the  semantics  level 
work  on  a  hyperplane  whose  axes  represent  the  presence  of  different  keywords. 
Depending  on  the  specific  technique,  the  axes  of  this  keyword  space  may  be 
discrete  (e.g.  boolean)  or  continuous  (e.g.  frequency  of  keywords,  importance  of 
keyword,  et  cetera).  The  dimensionality  of  the  keyword  space  depends  on  the 
cardinality  of  the  universal  set  of  keywords,  defined  as  the  union  of  all  possible 
keywords  of  all  documents  examined,  as  shown  in  (7).  Any  document  Dj  (a  set  of 
keywords)  in  the  corpora  at  hand  can  then  be  represented  by  a  multidimensional 
keyword  vector  : 


V  =  [jDi  =  {kuk2,...,kn}  (7) 

i 

=  {f{DjM),f{Dj,k2),.-.,f{Di,kn))  (8) 

Where  each  ordinate  f{Dj,  ki)  {1  <  i  <  n)  in  the  vector  represents  the  weight  of 
the  keyword  ki  in  U.  The  weight  is  the  result  of  some  metric  f{Dj,  ki)^  measuring 
the  presence,  frequency  importance  or  other  quantifiable  aspect  of  the  keyword 
ki  in  the  document. 

The  two  most  commonly  used  text  classification  approaches  are  outlined 
below. 

Distance-Based  Text  Classification  Distance-based  classification  involves 
the  comparison  of  high- dimensionality  keyword  vectors.  In  cases  where  the  vector 
describes  groups  of  documents,  it  identifies  the  centre  of  a  cluster  of  documents. 
Documents  are  classified  by  comparing  their  document  vectors.  The  metric  most 
commonly  used  is  the  cosine  of  the  angle  between  the  two  vectors,  derived  in 
terms  of  the  scalar  or  inner  product,  though  other  metrics  are  also  available  [1]. 

The  set  of  keywords  representing  one  document  (or  a  cluster  of  documents) 
may  be  obtained  by  numerous  different  algorithms  that  scan  corpora  of  docu¬ 
ments  for  keywords,  ranking  them  by  perceived  importance.  Weight  calculation 
metrics  range  from  a  simple  frequency-proportional  weighting  technique  that 
naively  attaches  more  importance  to  the  common  words  in  a  document,  to  the 
inverse  document  frequency  that  emphasises  those  keywords  that  are  common  in 
the  document  in  question,  yet  uncommon  in  the  overall  collection  of  documents. 

Datasets  for  such  systems  are  almost  always  built  automatically  and  main¬ 
tained  using  paradigms  like  learning  by  observation,  example  or  imitation  that 
abstract  away  the  actual  calculation  of  weights  and  formation  of  vectors.  This 
allows  the  user  to  obtain  complex  intelligent-like  text  classification  behaviour 
with  a  minimum  of  effort.  Unfortunately,  the  dimensionality  of  the  document 
vectors  is  typically  extremely  high  (usually  in  the  tens  of  thousands),  a  detail 
that  greatly  slows  down  classification  tasks  and  makes  storage  of  document  vec¬ 
tors  expensive  [2]. 

Rule-Based  Text  Classification  Rule-based  text  classification  has  been  in  use 
for  a  relatively  long  time  and  is  an  established  method  of  classifying  documents. 


122 


Common  applications  include  the  kill- file  article  filters  used  by  Usenet  client 
software  and  van  den  Berg’s  autonomous  E-mail  filter,  Procmail. 

In  this  context,  keyword  vectors  are  viewed  as  rule  preconditions;  the  class  a 
document  belongs  to  is  used  as  the  rule  decision: 


)  ^2  5  •  •  •  5  ^  ^ 

ri{D)  =p{D,ki)  Ap{D,k2)  A  ...  A p{D,kn)  D  G  P  (9) 

Where  ki  are  document  keywords,  U  is  the  universal  keyword  set,  D  is  one 
document,  P  is  a  document  class,  ri{D)  is  rule  i  applied  to  D  and  p{D^ki) 
is  a  function  evaluating  to  true  if  D  contains  keyword  ki  such  that  it  satisfies 
some  metric  (e.g.  a  minimum  frequency  or  weight).  Not  all  keywords  in  the 
universal  set  need  be  checked  for.  This  allows  rule-based  text  classifiers  to  exhibit 
a  notation  much  terser  than  that  of  vector-based  classifiers,  where  a  vector  must 
always  have  the  same  dimensionality  as  the  keyword  space. 

In  most  cases,  rules  are  written  by  the  human  user.  Most  typical  rule  bases 
simply  test  for  the  presence  of  specific  keywords  in  the  document  (so  p(T>,  ki) 
ki  G  D).  For  example,  a  Usenet  client  may  use  a  kill-file  to  filter  out  newsgroup 
articles  by  some  specific  person  by  looking  for  the  person’s  name  in  the  article’s 
‘Prom’  field.  Such  rule-based  approaches  are  inherently  simple  to  understand, 
which  accounts  for  their  popularity  among  end-users.  Unfortunately,  complex 
needs  often  result  in  very  complex  rule  bases,  ones  that  users  have  difficulty 
maintaining  by  hand. 

3  The  Proposed  System 

This  paper  proposes  a  system  that  builds  text  classification  rule  bases,  although 
it  is  trivial  to  adapt  it  to  distance-based  approaches  by  using  a  different  keyword 
acquisition  sub-system  (as  described  in  sect.  3.1).  The  modularity  of  the  pro¬ 
posed  technique  allows  this. 

The  application  domain  chosen  to  test  the  system  is  E-mail,  since  real-life 
corpora  of  E-mail  messages  are  very  easy  to  obtain.  Most  users  of  E-mail  keep 
‘folders’  of  messages  related  in  some  way.  This  provides  training  data  for  the 
system.  Like  many  documents.  E-mail  messages  are  structured:  each  message 
comprises  a  header,  itself  comprising  a  number  of  fields,  and  a  body.  Keywords 
may  need  to  be  treated  differently,  depending  on  where  in  the  message  they 
are  encountered.  For  example,  the  Message-ID  field  is  a  unique  identifier  of  an 
E-mail  message  and,  as  such,  a  notorious  opportunity  for  over- training.  This 
rigidly  structured  nature  of  E-mail  messages  makes  the  domain  attractive  as  a 
test-case  for  a  text  classification  system. 

The  system  comprises  two  main  stages,  as  shown  in  fig.  1:  the  keyword  ac¬ 
quisition  stage  reads  corpora  of  documents  (folders  of  similar  E-mail  messages), 
locates  candidate  keywords,  estimates  their  importance  and  builds  an  interme¬ 
diate  dataset  of  high  dimensionality.  The  RSAR  part  examines  the  dataset  and 
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Fig.  1.  Data  flow  through  the  system. 


removes  redundancy,  insofar  as  this  is  possible,  leaving  a  dataset  or  rule  base 
containing  a  drastically  reduced  number  of  preconditions  per  rule. 


3.1  Keyword  Acquisition 

This  sub-system  uses  a  set  of  E-mail  folders  as  input.  Each  folder  is  expected 
to  contain  similar  E-mail  messages.  Folders  (in  UNIX  mailbox  format)  contain 
standard  E-mail  messages  as  described  in  the  RFC-822  document  [7].  Messages 
are  sequentially  read;  each  field  (key /value  pair)  within  each  message  is  treated 
separately.  The  body  of  a  message  is  treated  as  a  very  long  field. 

Within  each  field,  words  are  isolated  and  pre-filtered  to  avoid  very  short  or 
long  keywords,  or  keywords  that  are  not  words  (e.g.  long  numbers  or  random 
sequences  of  characters).  Every  word  or  pair  of  consecutive  words  in  the  text  is 
considered  a  candidate  keyword.  The  name  of  the  current  field  is  also  added  to 
the  keyword,  so  that  information  on  where  in  the  message  the  keyword  occurred 
is  retained. 

At  this  stage,  the  keyword  acquisition  sub-system  has  two  modes  of  operation: 
one,  dubbed  single  cluster  mode,  generates  a  set  of  keywords  characterising  each 
entire  folder;  the  other  generates  separate  sets  of  keywords  for  each  message  in 
each  folder,  ultimately  used  to  create  one  classification  rule  per  message,  hence 
dubbed  one-per-message.  In  the  former  case,  shown  in  (10),  keyword  weights 
are  calculated  such  that  keywords  common  to  most  messages  are  deemed  more 
important.  In  the  latter  case  (11),  the  weighting  metric  emphasises  keywords 
that  show  the  message’s  difference  from  other  messages  in  the  collection.  This 
applies  pressure  to  diversify  the  keyword  sets,  rather  than  create  multiple  copies 
of  the  same  set  of  keywords  for  each  message.  The  weighting  functions  are  as 
follows: 


Wi(fc)  =  -  log  (^)  fkWf 


(10) 
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W2{k)  =  log  fkWf  (11) 

Where  Wi{k)  and  W2{k)  are  the  weights  of  keyword  A:  on  a  per-folder  and  per- 
message  basis  respectively;  Nk  is  the  number  of  messages  in  the  folder  containing 
A:;  is  the  total  number  of  messages  in  the  folder;  fk  is  the  frequency  of  the 
keyword  k  in  the  current  message;  and  Wf  denotes  the  current  field’s  importance 
to  the  classification,  which  depends  on  the  application  and  user  preferences. 
To  avoid  over-training  and  other  pitfalls,  certain  fields  are  marked  as  far  less 
important  than  others  and  this  influences  the  weights  of  keywords  within  them. 
For  instance,  it  is  relatively  safe  to  expect  the  subject  and  body  of  a  message  to 
contain  very  useful  information,  whereas  a  trace  of  the  message’s  delivery  path 
(the  Received  field)  is  unlikely  to  provide  useful  keywords. 

Finally,  before  the  weighted  keyword  is  added  to  the  set  of  keywords,  it  passes 
through  two  filters:  one  is  a  low-pass  filter  removing  words  so  uncommon  that  are 
definitely  not  good  keywords;  the  other  is  a  high-pass  filter  that  removes  far  too 
common  words  such  as  auxiliary  verbs,  articles  et  cetera.  This  gives  the  added 
advantage  of  language-independence  to  the  keyword  acquisition  algorithm:  most 
similar  methods  rely  on  English  thesauri  and  lists  of  common  English  words  to 
perform  the  same  function.  Finally,  all  weights  are  normalised  before  the  keyword 
sets  are  output.  This  allows  for  more  homogeneous  handling  of  weights  in  the 
next  stages  and  avoids  counter-intuitive  results  as  identified  in  [1]. 

It  must  be  emphasised  that  any  keyword  acquisition  algorithm  may  be  sub¬ 
stituted  for  the  one  described  above,  as  long  as  it  outputs  weighted  keywords. 


3.2  Rough  Set  Attribute  Reduction 

The  RSAR  used  here  is  exactly  as  described  previously.  It  reads  the  sets  of 
keywords  generated  by  the  keyword  acquisition  algorithm  above.  A  dataset 
is  constructed  by  evaluating  the  union  of  all  sets  of  weighted  keywords;  the 
keywords  are  sorted  in  order  of  decreasing  weight.  Where  one  keyword  has  two 
or  more  different  weights,  the  maximum  one  is  used.  Each  keyword  maps  to  one 
conditional  attribute  in  the  dataset.  The  decision  attribute  is  the  name  of  the 
folder  from  where  the  keyword  set  was  extracted.  Missing  values  in  the  dataset 
denote  the  absence  of  that  particular  keyword  in  the  particular  keyword  set. 

For  example,  the  two  sets  of  keywords  below,  describing  two  documents  Di 
and  D2,  may  be  used  to  build  the  dataset  shown  in  table  1. 


Di  =  {{A:i,  0.19),  {A:2, 0.98),  {A:3, 0.72),  (A:4, 0.87)}  (12) 

D2  =  {(A:4, 0.31),  {A:5, 0.42),  {A:6, 0.56)}  (13) 

Since  RSAR  is  better  suited  to  nominal  datasets,  the  dataset  is  thus  quant¬ 
ised.  Two  different  quantisation  methods  are  available:  a  boolean  quantisation, 
where  a  value  of  1  signifies  the  keyword’s  presence  and  a  value  of  0  signifies  its 
absence;  and  a  quantisation  of  the  normalised  weight  space  into  eleven  values, 
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Table  1.  Dataset  produced  from  (12)  and  (13),  assuming  they  belong  to  folders  a  and 
/3  respectively. 


k2 

kA 

ks 

ke 

k5 

ki 

->• 

Class 

Di 

0.98 

0.87 

0.72 

0.19 

a 

D2 

0.31 

0.56 

0.42 

0 

calculated  by  [10w\  (where  w  is  a.  keyword  weight  and  [-J  is  the  floor  function, 
evaluating  to  the  greatest  integer  less  than  or  equal  to  its  argument).  The  two 
methods  are  there  to  allow  better  interfacing  with  various  classifiers  as  well  as 
to  evaluate  the  best  technique  for  quantising  weights  in  the  application  domain 
at  hand. 

Having  quantised  the  intermediate,  high-dimensionality  dataset,  the  RSAR 
can  now  execute  the  QuickReduct  algorithm  to  remove  all  redundant  decision 
attributes.  This  results  in  a  dataset  of  radically  reduced  dimensionality.  Since 
each  object  in  the  dataset  comprises  a  set  of  conditional  attributes  accompanied 
by  a  decision  attribute  (the  document  class  the  object  belongs  to),  it  can  be 
viewed  as  a  set  of  production  rules,  with  conditional  attribute  values  being  the 
rule  preconditions  and  the  document  class  being  the  rule’s  decision.  The  dataset 
is  simply  post-processed  to  remove  duplicate  rules  and  output  in  the  form  of  a 
rule  base. 


4  Experimental  Results 

The  use  of  the  above  proposed  system  allows  the  generation  of  rule  bases  of 
extremely  reduced  dimensionality  from  corpora  of  documents,  without  reducing 
the  classification  accuracy,  inasmuch  as  this  is  possible.  To  demonstrate  this, 
seven  different  corpora  of  E-mail  messages  were  used.  The  corpora  were  chosen 
so  as  to  provide  a  wide  spectrum  of  characteristics  found  in  sets  of  documents: 
some  are  homogeneous,  containing  messages  by  a  single  author;  others  contain 
text  from  multiple  authors  with  different  writing  styles.  Small  and  larger  corpora 
are  mixed  to  ensure  that  the  size  of  the  document  collection  does  not  influence 
the  quality  of  the  resultant  rule  base.  Each  collection  of  messages  represents  one 
class  of  documents. 

Random  groups  of  two  to  five  such  message  folders  are  chosen  and  fed  to 
the  system.  All  combinations  of  keyword  generation  (single  cluster  or  one-per- 
message)  and  quantisation  method  (boolean,  weight  integer  part)  are  investig¬ 
ated.  Table  2  shows  the  average  dimensionality  reduction  for  each  combination. 
Dimensionality  reduction  is  shown  in  orders  of  magnitude  (base  10).  Note  that, 
although  the  one-per-message  dataset  generation  technique  shows  greater  re¬ 
duction,  the  dimensionality  is  typically  much  higher  than  for  the  single  cluster 
method.  Average  pre-reduction  dimensionality  is  26,827  for  the  one-per-message 
technique  and  338  for  the  single-cluster  technique.  The  resultant  rule  bases  have 
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rules  involving  one  to  six  preconditions,  with  the  boolean  rule  bases  exhibit¬ 
ing  slightly  higher  dimensionality  due  to  boolean  conditional  attributes’  lesser 
information  content. 

Figure  2  shows  the  average  discernibility  in  the  generated  rule  bases.  A  dis- 
cernibility  of  1  indicates  no  loss  of  information  after  the  RSAR  algorithm  has 
executed.  A  classifier  employing  the  rule  base  in  question  will,  assuming  the 
training  set  is  a  good  statistical  sample,  achieve  a  classification  accuracy  very 
similar  to  the  discernibility  of  the  rule  base.  As  shown  in  the  figure,  generating 
one  rule  for  an  entire  folder  of  E-mail  messages  does  not  allow  for  good  classifica¬ 
tion.  Discernibility  can  drop  to  unacceptable  levels  and  varies  widely  depending 
on  the  content  of  the  text  corpora.  By  contrast,  generating  multiple  rules  allows 
for  much  better  coverage  of  the  feature  space  —  discernibility  is  high  enough  to 


Table  2.  Average  dimensionality  reduction  in  orders  of  magnitude  for  all  four  com¬ 
binations  of  operation  modes. 


Single  cluster  One-per-message 


Boolean 

Weighted 


2.02 

2.29 


3.62 

3.70 
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satisfactory  and  appears  to  be  almost  constant.  Binary  quantisation  seems  to 
offer  slightly  less  satisfactory  results  than  the  weighted  method. 

In  terms  of  the  linguistic  nature  of  the  rule  bases,  it  is  considerably  easier 
for  a  human  to  read  and  understand  rules  that  span  entire  corpora,  rather  than 
rules  that  describe  single  documents.  It  is  also  possible  to  judge  the  quality  of 
the  rule  base  in  an  intuitive  manner,  simply  by  reading  it. 


5  Conclusion 

Text  classification  relies  heavily  on  the  acquisition  of  sets  of  co-ordinate  keywords 
describing  documents.  This  paper  has  summarised  a  technique  that  reduces  the 
need  to  choose  a  restricted  number  of  good  quality  keywords  by  allowing  larger 
collections  of  keywords  to  be  built.  The  dimensionality  of  these  sets  of  keywords 
is  then  reduced  with  the  aid  of  Rough  Set  Theory,  while  maintaining  intact  the 
information  contained  in  the  keyword  sets.  The  technique  is  efficient,  language- 
independent  and  particularly  flexible  due  to  its  modular  nature.  The  decrease 
in  dimensionality  is  drastic,  due  to  RSAR’s  reduction  of  the  keyword  set  to  the 
minimum  required  for  the  classification  at  hand. 

The  system  can  be  interfaced  to  a  number  of  text  classifiers  to  produce  highly 
optimised  rule-based  text  classification  applications,  while  still  allowing  the  user 
to  read  and  intuitively  understand  the  rule  bases. 

The  approach  is  still  in  its  early  states  of  research,  which  accounts  for  the 
less-than-acceptable  results  of  the  single-cluster  rule  generation  method.  Fur¬ 
ther  investigation  into  suitable  keyword  weighting  metrics  and  rule  induction 
especially  designed  for  text  classification  is  in  progress,  as  is  an  actual  test-bed 
application  of  the  present  technique  to  the  classification  of  in-coming  E-mail 
messages. 
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Abstract.  This  article  describes  a  way  of  designing  a  hybrid  system  for 
classification  and  rule  generation,  in  soft  computing  paradigm,  integrat¬ 
ing  rough  set  theory  with  a  fuzzy  MLP  using  an  evolutionary  algorithm. 

An  /-class  classification  problem  is  split  into  I  two-class  problems.  Crude 
subnetworks  are  initially  obtained  for  each  of  these  two-class  problems 
via  rough  set  theory.  These  subnetworks  are  then  combined  and  the  final 
network  is  evolved  using  a  GA  with  restricted  mutation  operator  which 
utilizes  the  knowledge  of  the  modular  structure  already  generated,  for 
faster  convergence.  The  GA  tunes  the  fuzzification  parameters,  and  the 
network  weights  and  structure  simultaneously,  by  optimizing  a  single 
fitness  function. 

1  Introduction 

Soft  Computing  is  a  consortium  of  methodologies  which  works  S3mergetically 
(not  competitively)  and  provides,  in  one  form  or  another,  flexible  information 
processing  capability  for  handling  real  life  ambiguous  situations  [1].  There  are 
ongoing  efforts  to  integrate  artificial  neural  networks  (ANNs)  with  fuzzy  set 
theory,  genetic  algorithms  (GAs),  rough  set  theory  and  other  methodologies  in 
the  soft  computing  pdiTdjdxgm  [2]. 

Knowledge-based  networks  [3,4]  constitute  a  special  class  of  ANNs  that  con¬ 
sider  crude  domain  knowledge  to  generate  the  initial  network  architecture,  which 
is  later  refined  in  the  presence  of  training  data.  Recently,  the  theory  of  rough 
sets  has  been  used  to  generate  knowledge-based  networks. 

A  recent  trend  in  neural  network  design  for  large  scale  problems  is  to  split  the 
original  task  into  simpler  subtasks,  and  use  a  subnetwork  module  for  each  of  the 
subtasks  [5].  The  divide  and  conquer  strategy  leads  to  super-linear  speedup  in 
training.  It  has  been  shown  that  by  combining  the  output  of  several  subnetworks 
in  an  ensemble,  one  can  improve  the  generalization  ability  over  that  of  a  single 
large  network  [6]. 

In  the  present  article  an  evolutionary  strategy  is  suggested  for  designing  a 
connectionist  system,  integrating  fuzzy  sets  and  rough  sets.  The  basic  building 
block  is  a  Rough  Fuzzy  MLP  [7].  The  evolutionary  training  algorithm  gener¬ 
ates  the  weight  values  for  a  parsimonious  network  and  simultaneously  tunes 
the  fuzzification  parameters  by  optimizing  a  single  fitness  function.  Rough  set 
theory  is  used  to  obtain  a  set  of  probable  knowledge-based  subnetworks  which 
form  the  initial  population  of  the  GA.  These  modules  are  then  integrated  and 
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evolved  with  a  restricted  mutation  operator  that  helps  preserve  extracted  lo¬ 
calized  rule  structures  as  potential  solutions.  An  restricted  mutation  operator 
is  implemented,  which  utilizes  the  knowledge  of  modular  structure  evolved  to 
achieve  faster  convergence. 

2  Fuzzy  MLP 

The  fuzzy  MLP  model  [8]  incorporates  fuzziness  at  the  input  and  output  levels  of 
the  MLP,  and  is  capable  of  handling  exact  (numerical)  and/or  inexact  (linguistic) 
forms  of  input  data.  Any  input  feature  value  is  described  in  terms  of  some 
combination  of  membership  values  in  the  linguistic  property  sets  low  (L),  medium 
(M)  and  high  (H).  Class  membership  values  {n)  of  patterns  are  represented  at 
the  output  layer  of  the  fuzzy  MLP.  During  training,  the  weights  are  updated  by 
backpropagating  errors  with  respect  to  these  membership  values  such  that  the 
contribution  of  uncertain  vectors  is  automatically  reduced. 

A  single  layer  feedforward  MLP  is  used.  An  n-dimensional  pattern  Fi  = 
[Fii,  Fi2, . . . , Fin]  is  represented  as  a  3n-dimensional  vector 

Fi  =  [^iow(Fii)(Fi),...,//high(Fi„)(Fi)]  =  [yi,y2>-** jysnl  »  W 

where  the  fi  values  indicate  the  membership  functions  of  the  corresponding  lin¬ 
guistic  TT-sets  low,  medium  and  high  along  each  feature  axis  and  2/?, .  •  •  ,2/3n  refer 
to  the  activations  of  the  3n  neurons  in  the  input  layer. 

When  the  input  feature  is  numerical,  we  use  the  7r-fuzzy  sets  (in  the  one 
dimensional  form),  with  range  [0,1],  represented  as 

/■2(l_Jl£i^)2_for|<||F,-c|l<A 

^(Fj;  c,X)  =  \l-  for  0  <  1|F,  -  c||  <  ^  (2) 

(  0  ,  otherwise  , 

where  A(>  0)  is  the  radius  of  the  7r-function  with  c  as  the  central  point.  Note 
that  features  in  linguistic  and  set  forms  can  also  be  handled  in  this  framework 
[8]. 

3  Rough  Fuzzy  MLP 

Here  we  describe  the  Rough  Fuzzy  MLP  [7].  Let  <S  =<  f/,A  >  be  a  decision 
table,  with  C  and  D  —  {di,...,d|}  its  sets  of  condition  and  decision  attributes 
respectively.  Divide  the  decision  table  S  =<  U,A>  into  I  tables  Si  =  <  Ui,Ai  > 
,  i  =  1, corresponding  to  the  I  decision  attributes  di, ...,  d/,  where 
U  =  UiU..,UUi  and  Ai  =  CU{df}. 

The  size  of  each  Si  {i  =  is  first  reduced  with  the  help  of  a  threshold 

on  the  number  of  occurrences  of  the  same  pattern  of  attribute  values.  This  will 
be  elicited  in  the  sequel.  Let  the  reduced  decision  table  be  denoted  by  7i,  and 
Xip}  be  the  set  of  those  objects  of  Ui  that  occur  in  Tui  =  1, L 
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Now  for  each  rff-reduct  B  =  {bi,...,hk}  (say),  a  discernibility  matrix  (denoted 
Mrf,.  (5))  from  the  dj -discernibility  matrix  is  defined  as  follows: 

Cij  =  {a  G  B  :  a(xi)  a(ij)},  ,  (3) 

for  =  1, 

For  each  object  xj  G  0:^1 , ,  Xi^ ,  the  discernibility  function  is  defined  as 

/*  =  :  1  <  *.  J  i  <  ».  Cy  7^  0}  ,  (4) 

where  V(%)  disjunction  of  all  members  of  Cij.  Then  is  brought  to 

its  conjunctive  normal  form  (c.n.f).  One  thus  obtains  a  dependency  rule  rj,  viz. 
Pi  ^  dif  where  Pi  is  the  disjunctive  normal  form  (d.n.f)  of  j 
The  dependency  factor  dfi  for  r,-  is  given  by 

card{POSi(di)) 

'"dirdWiT'  ’ 


where  POSi{di)  —  Uxe/^.  h{^)  is  the  lower  approximation  of  X 

with  respect  to  li.  In  this  case  dfi  =  1  [7]. 

Consider  the  case  of  feature  Fj  for  class  c*  in  the  /-class  problem  domain. 
The  inputs  for  the  representative  sample  Fi  are  mapped  to  the  corresponding 
three-dimensional  feature  space  of|<,oti;(Fo),(Fi),  and  fJthi9h(Fij)(Fi), 

by  eqn.  (1).  Let  these  be  represented  by  Lj,  Mj  and  Hj  respectively.  Then  con¬ 
sider  only  those  attributes  which  have  a  numerical  value  greater  than  some 
threshold  Th  (0.5  <  Th  <  1).  This  implies  clamping  only  those  features  demon¬ 
strating  high  membership  values  with  one,  while  the  others  are  fixed  at  zero.  As 
the  method  considers  multiple  objects  in  a  class  a  separate  n*  x  3n-dimensional 
attribute-value  decision  table  is  generated  for  each  class  Ck  (where  Uk  indicates 
the  number  of  objects  in  c^). 

Let  there  be  m  sets  Oi,...,Oni  of  objects  in  the  table  having  identical  at¬ 
tribute  values,  and  card(jOi)  —  nki,i  —  1,  such  that  riki  >  •  •  ■  >  and 
^ki  =  nk’  The  attribute- value  table  can  now  be  represented  as  an  m  x  3n 
array.  Let  Uk'^ ,  njt' , . . . ,  Uk'^  denote  the  distinct  elements  among  Uki , . . . ,  rik^ 
such  that  nk[  >  fik'^  >  ...  >  nk>^ .  Let  a  heuristic  threshold  function  be  defined 


(6) 


so  that  all  entries  having  frequency  less  than  Tr  are  eliminated  from  the  ta¬ 
ble,  resulting  in  the  reduced  attribute- value  table.  Note  that  the  main  motive 
of  introducing  this  threshold  function  lies  in  reducing  the  size  of  the  resulting 
network.  One  attempts  to  eliminate  noisy  pattern  representatives  (having  lower 
values  of  n^.)  from  the  reduced  attribute- value  table. 

While  designing  the  initial  structure  of  the  rough  fuzzy  MLP,  the  union  of 
the  rules  of  the  /  classes  is  considered.  The  input  layer  consists  of  3n  attribute 
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values  while  the  output  layer  is  represented  by  I  classes.  The  hidden  layer  nodes 
model  the  first  level  (innermost)  operator  in  the  antecedent  part  of  a  rule,  which 
can  be  either  a  conjunct  or  a  disjunct.  The  output  layer  nodes  model  the  outer 
level  operands,  which  can  again  be  either  a  conjunct  or  a  disjunct.  For  each 
inner  level  operator,  corresponding  to  one  output  class  (one  dependency  rule), 
one  hidden  node  is  dedicated.  Only  those  input  attributes  that  appear  in  this 
conjunct/disjunct  are  connected  to  the  appropriate  hidden  node,  which  in  turn 
is  connected  to  the  corresponding  output  node.  Each  outer  level  operator  is 
modeled  at  the  output  layer  by  joining  the  corresponding  hidden  nodes.  Note 
that  a  single  attribute  (involving  no  inner  level  operators)  is  directly  connected 
to  the  appropriate  output  node  via  a  hidden  node,  to  maintain  uniformity  in 
rule  mapping. 

Let  the  dependency  factor  for  a  particular  dependency  rule  for  class  Ck  be 
df  =  a  =  Ihy  eqn.  (5).  The  weight  between  a  hidden  node  i  and  output 
node  k  is  set  at  ^  -f  e,  where  fac  refers  to  the  number  of  outer  level  operands 
in  the  antecedent  of  the  rule  and  e  is  a  small  random  number  taken  to  destroy 
any  symmetry  among  the  weights.  Note  that  fac  >  1  and  each  hidden  node 
is  connected  to  only  one  output  node.  Let  the  initial  weight  so  clamped  at  a 
hidden  node  be  denoted  as  13.  The  weight  between  an  attribute  aj  (where 
a  corresponds  to  low  (L),  medium  (M)  or  high  (H)  )  and  hidden  node  i  is  set 
to  -f  £•,  such  that  facd  is  the  number  of  attributes  connected  by  the  cor¬ 
responding  inner  level  operator.  Again  facd  >  1.  Thus  for  an  Lclass  problem 
domain  there  are  at  least  /  hidden  nodes.  All  other  possible  connections  in  the 
resulting  fuzzy  MLP  are  set  as  small  random  numbers.  It  is  to  be  mentioned 
that  the  number  of  hidden  nodes  is  determined  from  the  dependency  rules. 

4  Modular  Knowledge-Based  Network 

It  is  believed  that  the  use  of  Modular  Neural  Network  (MNN)  enables  a  wider 
use  of  ANNs  for  large  scale  systems.  Embedding  modularity  {i.e.  to  perform  local 
and  encapsulated  computation)  into  neural  networks  leads  to  many  advantages 
compared  to  the  use  of  a  single  network.  For  instance,  constraining  the  network 
connectivity  increases  its  learning  capacity  and  permits  its  application  to  large 
scale  problems  [5].  It  is  easier  to  encode  a  priori  knowledge  in  modular  neural 
networks.  In  addition,  the  number  of  network  parameters  can  be  reduced  by  using 
modularity.  This  feature  speeds  computation  and  can  improve  the  generalization 
capability  of  the  system. 

We  use  two  phases.  First  an  /-class  classification  problem  is  split  into  /  two- 
class  problems.  Let  there  be  /  sets  of  subnetworks,  with  3n  inputs  and  one  output 
node  each.  Rough  set  theoretic  concepts  are  used  to  encode  domain  knowledge 
into  each  of  the  subnetworks,  using  eqns  (3)-(6).  The  number  of  hidden  nodes  and 
connectivity  of  the  knowledge-based  subnetworks  is  automatically  determined. 
A  two-class  problem  leads  to  the  generation  of  one  or  more  crude  subnetworks, 
each  encoding  a  particular  decision  rule.  Let  each  of  these  constitute  a  pool.  So 
we  obtain  m>  I  pools  of  knowledge-based  modules.  Each  pool  k  is  perturbed  to 
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generate  a  total  of  Uk  subnetworks,  such  that  ni  =  . . .  —  n/b  —  . . .  —  rim-  These 
pools  constitute  the  initial  population  of  subnetworks,  which  are  then  evolved 
independently  using  genetic  algorithms. 

At  the  end  of  training,  the  modules /subnetworks  corresponding  to  each  two- 
class  problem  are  concatenated  to  form  an  initial  network  for  the  second  phase. 
The  inter  module  links  are  initialized  to  small  random  values  as  depicted  in 
Fig.  1.  A  set  of  such  concatenated  networks  forms  the  initial  population  of  the 
GA.  The  mutation  probability  for  the  inter-module  links  is  now  set  to  a  high 
value,  while  that  of  intra-module  links  is  set  to  a  relatively  lower  value.  This  sort 
of  rcs^ncied  mutation  helps  preserve  some  of  the  localized  rule  structures,  already 
extracted  and  evolved,  as  potential  solutions.  The  initial  population  for  the  GA 
of  the  entire  network  is  formed  from  all  possible  combinations  of  these  individual 
network  modules  and  random  perturbations  about  them.  This  ensures  that  for 
complex  multi-modal  pattern  distributions  all  the  different  representative  points 
remain  in  the  population.  The  algorithm  then  searches  through  the  reduced 
space  of  possible  network  topologies. 


^  Modile  1 
©  ModileJ 
^  Module} 


-  UeuillWct 

- —  InlM  aifigneit 
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Fig.  1.  Intra  and  Inter  module  links 


5  Evolutionary  Design 

Genetic  algorithms  are  highly  parallel  and  adaptive  search  processes  based  on 
the  principles  of  natural  selection.  Here  we  use  GAs  for  evolving  the  weight 
values  as  well  as  the  structure  of  the  fuzzy  MLP  used  in  the  framework  of 
modular  neural  networks.  The  input  and  output  fuzzification  parameters  are 
also  tuned.  Unlike  other  theory  refinement  systems  which  train  only  the  best 
network  approximation  obtained  from  the  domain  theories,  the  initial  population 
here  consists  of  all  possible  networks  generated  from  rough  set  theoretic  rules. 
This  is  an  advantage  because  potentially  valuable  information  may  be  wasted  by 
discarding  the  contribution  of  less  successful  networks  at  the  initial  level  itself. 
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Genetic  algorithms  involve  three  basic  procedures  -  encoding  of  the  problem 
parameters  in  the  form  of  binary  strings,  application  of  genetic  operators  like 
crossover  and  mutation,  selection  of  individuals  based  on  some  objective  func¬ 
tion  to  create  a  new  population.  Each  of  these  aspects  is  discussed  below  with 
relevance  to  our  algorithm. 


5.1  Chromosomal  Representation 

The  problem  variables  consist  of  the  weight  values  and  the  input /output  fuzzi¬ 
fication  parameters.  Each  of  the  weights  is  encoded  into  a  binary  word  of  16  bit 
length,  where  [000. ..0]  decodes  to  —128  and  [111...1]  decodes  to  128.  An  addi¬ 
tional  bit  is  assigned  to  each  weight  to  indicate  the  presence  or  absence  of  the 
link.  If  this  bit  is  0  the  remaining  bits  are  unrepresented  in  the  phenot3rpe.  The 
total  number  of  bits  in  the  string  is  therefore  dynamic  [9].  Thus  a  total  of  17  bits 
are  assigned  for  each  weight.  The  fuzzification  parameters  tuned  are  the  centers 
(c)  and  radius  (A)  for  each  of  the  linguistic  attributes  low^  medium  and  high  of 
each  feature  (eqn.  2).  These  are  also  coded  as  16  bit  strings  in  the  range  [0,2]. 

Initial  population  is  generated  by  coding  the  networks  obtained  by  rough 
set  based  knowledge  encoding,  and  by  random  perturbations  about  them.  A 
population  size  of  64  was  considered. 


5.2  Genetic  Operators 

Crossover  It  is  obvious  that  due  to  the  large  string  length,  single  point 
crossover  would  have  little  effectiveness.  Multiple  point  crossover  is  adopted, 
with  the  distance  between  two  crossover  points  being  a  random  variable  be¬ 
tween  8  and  24  bits.  This  is  done  to  ensure  a  high  probability  for  only  one 
crossover  point  occurring  within  a  word  encoding  a  single  weight.  The  crossover 
probability  is  fixed  at  0.7. 


Mutation  The  search  string  being  very  large,  the  influence  of  mutation  is  more 
on  the  search.  Each  of  the  bits  in  the  string  is  chosen  to  have  some  mutation 
probability  {pmut).  This  mutation  probability  however  has  a  spatio-temporal 
variation.  The  maximum  value  of  pmut  is  chosen  to  be  0.4  and  the  minimum 
value  as  0.01.  The  mutation  probabilities  also  vary  along  the  encoded  string, 
with  the  bits  corresponding  to  inter-module  links  being  assigned  a  probability 
pmut  (i.e.,  the  value  of  pmut  at  that  iteration)  and  intra-module  links  assigned 
a  probability  pmut /It).  This  is  done  to  ensure  least  alterations  in  the  structure  of 
the  individual  modules  already  evolved.  Hence,  the  mutation  operator  indirectly 
incorporates  the  domain  knowledge  extracted  through  rough  set  theory. 


5.3  Choice  of  fitness  function 

In  GAs  the  fitness  function  is  the  final  arbiter  for  string  creation,  and  the  nature 
of  the  solution  obtained  depends  on  the  objective  function.  An  objective  function 
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of  the  form  described  below  is  chosen. 


F  =  ai/i  +  02/2  , 


where 


h  = 


No,  of  Correctly  Classified  Sample  in  Training  Set 
Total  No.  of  Samples  in  Training  Set 
No.  of  links  present 


/2  =  1  - 


Total  No.  of  links  possible 


Here  ai  and  02  determine  the  relative  weightage  of  each  of  the  factors,  oi  is 
taken  to  be  0.9  and  02  is  taken  as  0.1,  to  give  more  importance  to  the  classifica¬ 
tion  score  compared  to  the  network  size  in  terms  of  number  of  links.  Note  that 
we  optimize  the  network  connectivity,  weights  and  input /output  fuzzification 
parameters  simultaneously. 


5.4  Selection 

Selection  is  done  by  the  roulette  wheel  method.  The  probabilities  are  calculated 
on  the  basis  of  ranking  of  the  individuals  in  terms  of  the  objective  function, 
instead  of  the  objective  function  itself.  Fitness  ranking  overcomes  two  of  the 
biggest  problems  inherited  from  traditional  fitness  scaling  :  over  compression 
and  under  expansion. 

Elitism  is  incorporated  in  the  selection  process  to  prevent  oscillation  of  the 
fitness  function  with  generation.  The  fitness  of  the  best  individual  of  a  new  gen¬ 
eration  is  compared  with  that  of  the  current  generation.  If  the  latter  has  a  higher 
value  -  the  corresponding  individual  replaces  a  randomly  selected  individual  in 
the  new  population. 


6  Implementation  and  Results 

The  genetic-rough-neuro-fuzzy  algorithm  has  been  implemented  on  speech  data. 

Let  the  proposed  methodology  be  termed  Model  S.  Other  models  compared 
include: 

Model  O:  An  ordinary  MLP  trained  using  backpropagation  (BP)  with  weight 
decay.  Model  F:  A  fuzzy  multilayer  perceptron  trained  using  BP  [8]  (with  weight 
decay). 

Model  R:  A  fuzzy  multilayer  perceptron  trained  using  BP  (with  weight  de¬ 
cay),  with  initial  knowledge  encoding  using  rough  sets  [7]. 

Model  FM:  A  modular  fuzzy  multilayer  perceptron  trained  with  GAs  along 
with  tuning  of  the  fuzzification  parameters.  Here  the  term  modular  refers  to 
the  use  of  subnetworks  corresponding  to  each  class,  that  are  later  concatenated 
using  GAs. 
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A  threshold  value  of  0.5  is  applied  on  the  fiizzified  inputs  to  generate  the 
attribute  value  table  used  in  rough  set  encoding,  such  that  =  1  if  2/f  >  0.5 
and  0  otherwise.  Here,  50%  of  the  samples  are  used  as  training  set  and  the 
network  is  tested  on  the  remaining  samples. 

The  speech  data  Vowel  deals  with  871  Indian  Telegu  vowel  sounds.  These 
were  uttered  in  a  consonant- vowel-consonant  context  by  three  male  speakers  in 
the  age  group  of  30  to  35  years.  The  data  set  has  three  features:  Fj,  F2  and  F3 
corresponding  to  the  first,  second  and  third  vowel  formant  frequencies  obtained 
through  spectrum  analysis  of  the  speech  data.  There  are  six  classes  6,  a,i,  w,  e,  o. 
These  overlapping  classes  will  be  denoted  by  ci,  C2, . .  • , ce- 

The  rough  set  theoretic  technique  is  applied  on  the  vowel  data  to  extract 
some  knowledge  which  is  initially  encoded  among  the  connection  weights  of  the 
subnetworks.  The  data  is  first  transformed  into  a  nine  dimensional  linguistic 
space. 

The  dependency  rules  obtained  are  : 

Cl  f-  Ml  V  L3,  Cl  Ml  V  M3,  C2  ^  M2  V  M3,  C3  ^  {Li  A  M3)  V  {Li  A  JT3), 

C4  (iy2  A  M3)  V  (7/1  A  Z/2)  V  (Z/i  A  Lz)  V  {L2  A  Lz) 

C4  ^  {L2AHz)  V  (Li  AL2)  V(Li  AL3)  V(L2  AL3),  C5  (Ml  A  M3)  V  {Hi  A  M3), 

C5  {Hi  A  M3)  V  {Hi  A  Hz)  V  (M3  A  Hz)cq  i—  Li  y  M3, 

ce  Ml  V  M3,  cg  Li  V  Hzi  cq  ^  Mi  V  Hz- 

The  above  rules  are  used  to  get  initial  subnetwork  modules  using  the  scheme 
outlined  in  Section  3.  The  integrated  network  contains  a  single  hidden  layer 
with  15  nodes.  In  all,  32  such  networks  are  obtained.  The  remaining  32  networks 
are  obtained  by  small  (20%)  random  perturbations  about  them,  to  generate  an 
initial  population  of  64  individuals. 

The  performance  of  Model  S  along  with  its  comparison  with  other  models 
using  the  same  number  of  hidden  nodes  is  presented  in  Table  1.  In  the  first  phase 
of  the  GA  (for  models  FM  and  S),  each  of  the  subnetworks  are  partially  trained 
for  10  sweeps  each.  It  is  observed  that  Model  S  performs  the  best  with  the  least 
network  size  after  being  trained  for  only  90  sweeps  in  the  final  phase.  Comparing 
Models  F  and  R,  we  observe  that  the  incorporation  of  domain  knowledge  in  the 
latter  through  rough  sets  boosts  its  performance.  Similarly,  using  the  modular 
approach  with  GA  in  Model  FM  improves  its  efficiency  over  that  of  Model  F. 
Note  that  Model  S  encompasses  both  models  R  and  FM.  Hence  it  results  in  the 
least  redundant  yet  effective  model. 

7  Conclusions 

A  methodology  for  integrating  rough  sets  with  fuzzy  MLP  using  genetic  algo¬ 
rithms  for  designing  a  knowledge-based  network  for  pattern  classification  and 
rule  generation  is  presented.  The  proposed  algorithm  involves  synthesis  of  sev¬ 
eral  MLP  modules,  each  encoding  the  rough  set  rules  for  a  particular  class.  These 
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Table  1.  Comparative  performance  of  different  models  for  Vowel  data 


Models 

Model  0 

Model  F 

Model  R 

Model  FM 

Model  S 

Train 

Test 

Train 

Test 

Train 

Test 

Train 

Test 

Train 

Test 

Cl(%) 

11.20 

ij^jj 

15.70 

14.21 

44.12 

B»*lll 

mm 

mm 

C2(%) 

75.71 

Baglil 

ililillil 

mm 

C3(%) 

CliMi 

unm 

mm 

C4(%) 

71.43 

EHggl 

ECTilil 

ElilHil 

C5(%) 

ECTin 

94.22 

EElElii 

82.21 

Eliggl 

mm 

C6(%) 

76.47 

QSQ 

ETim 

miiiiii 

fliTiXil 

EliWfil 

Net(%) 

Eliiin 

87.21 

mm 

#  links 

131 

210 

152 

124 

84 

Iterations 

5600 

5600 

2000 

200 

90 

knowledge-based  modules  are  refined  using  a  GA.  The  genetic  operators  are  im¬ 
plemented  in  such  a  way  that  they  help  preserve  the  modular  structure  already 
evolved.  It  is  seen  that  this  methodology  along  with  modular  network  decom¬ 
position  results  in  superior  performance  in  terms  of  classification  score,  training 
time,  and  network  sparseness  (thereby  enabling  easier  extraction  of  rules)  as 
compared  to  earlier  hybridizations. 
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Abstract.  We  consider  approximate  versions  of  fundamental  notions  of 
theories  of  rough  sets  and  association  rules.  We  analyze  the  complexity 
of  searching  for  a-reducts,  understood  as  subsets  discerning  ”a-almost” 
objects  from  different  decision  classes,  in  decision  tables.  We  present  how 
optimal  approximate  association  rules  can  be  derived  from  data  by  using 
heuristics  for  searching  for  minimal  a-reducts.  NP-hardness  of  the  prob¬ 
lem  of  finding  optimal  approximate  association  rules  is  shown  as  well. 
It  malces  the  results  enabling  the  usage  of  rough  sets  algorithms  to  the 
search  of  association  rules  extremely  important  in  view  of  applications. 


1  Introduction 

Theory  of  rough  sets  ([5])  provides  efficient  tools  for  dealing  with  fundamental 
data  mining  challenges,  like  data  representation  and  classification,  or  knowledge 
description  (see  e.g.  [2],  [3],  [4],  [8]).  Basing  on  the  notions  of  information  system 
and  decision  table,  the  language  of  reducts  and  rules  was  proposed  for  expressing 
dependencies  between  considered  features,  in  view  of  gathered  information. 

Given  a  distinguished  feature,  called  decision,  the  notion  of  decision  reduct  is 
constructed  over,  so  called,  discernibility  matrix  ([7]),  where  information  about 
all  pairs  of  objects  with  different  decision  values  is  stored.  A  reduct  is  any  min¬ 
imal  (in  sense  of  inclusion)  subset  of  non-decision  features  (conditions)  which 
discern  all  such  pairs,  necessary  to  be  considered,  e.g.,  with  respect  to  proper 
decision  classification  of  new  cases. 

In  real  applications,  basing  on  such  deterministic  reducts,  understood  as 
above,  is  often  too  restrictive  with  respect  to  discerning  all  necessary  pairs.  In¬ 
deed,  deterministic  dependencies  may  require  too  many  conditions  to  be  involved 
to.  Several  approaches  to  uncertainty  representation  of  decision  rules  and  reducts 
were  proposed  to  weaken  the  above  conditions  (see  e.g.  [6],  [8],  [9]).  In  the  lan¬ 
guage  of  reducts  and  their  discernibility  characteristics,  we  can  say  that  such 
uncertainty  or  imprecision  can  be  connected  with  a  ratio  of  pairs  from  different 
decision  classes  which  remain  not  discerned  by  such  an  approximate  reduct. 

Applications  of  rough  sets  theory  to  the  generation  of  rules,  for  classifica¬ 
tion  of  new  cases  or  representation  of  data  information,  are  usually  restricted 
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to  searching  for  decision  rules  with  a  fixed  decision  feature  related  to  a  rule’s 
consequence.  Recently,  however,  more  and  more  attention  is  paid  on,  so  called, 
associative  mechanism  of  rules’  generation,  where  all  attributes  can  occur  as  in¬ 
volved  to  conditions  or  consequences  of  particular  rules  (compare  with,  e.g.,  [1], 
[10]).  Relationship  between  techniques  of  searching  for  optimal  association  rules 
and  rough  sets  optimization  tasks,  like,  e.g.,  the  templates  generation,  were  stud¬ 
ied  in  [3].  In  this  paper  we  would  like  to  focus  on  approximate  association  rules, 
analyzing  both  complexity  of  related  search  tasks  and  their  correspondence  to 
approximate  reducts. 

The  reader  may  pay  attention  on  similarities  between  construction  of  proofe 
of  complexity  results  concerning  approximate  reducts  and  association  rules.  We 
believe  that  presented  techniques  can  be  regarded  as  even  more  universal  for  sim¬ 
ple  and  intuitive  characteristics  of  related  optimization  tasks.  What  even  more 
important,  however,  is  the  correspondence  between  the  optimization  problems 
concerning  the  above  mentioned  notions  -  Although  the  problems  of  finding 
both  minimal  approximate  reducts  and  all  approximate  reducts  are  NP-hard, 
the  existence  of  very  efficient  and  fast  heuristics  for  solving  them  (compare,  e.g., 
with  [4])  makes  such  a  correspondence  very  important  tool  for  development  of 
appropriate  methods  of  finding  optimal  approximate  association  rules. 

The  paper  is  organized  as  follows:  In  Section  2  we  introduce  basic  notions 
of  rough  sets  theory  and  consider  the  complexity  of  searching  for  minimal  ap¬ 
proximate  (in  sense  of  discernibility)  reducts  in  decision  tables.  In  Section  3 
we  introduce  the  notion  of  association  rule  as  strongly  related  to  the  notion  of 
template,  known  from  rough  sets  theory.  Similarly  as  in  case  of  approximate 
reducts,  we  show  the  NP-hardness  of  the  problem  of  finding  optimal  approx¬ 
imate  (in  sense  of  a  confidence  threshold)  association  rule  corresponding  to  a 
given  template.  In  Section  4  we  show  how  optimal  approximate  association  rules 
can  be  searched  for  as  minimal  approximate  reducts,  by  using  an  appropriate 
decision  table  representation.  In  Section  5  we  conclude  the  paper  with  pointing 
the  directions  of  further  research. 

2  Approximate  redacts 

An  information  system  is  a  pair  S  =  (C/,  A),  where  C/  is  a  non-empty,  finite  set 
called  the  universe  and  A  is  a  non-empty,  finite  set  of  attributes.  Each  a  e  A 
corresponds  to  function  a  :U  -^Va,  where  Va  is  called  the  value  set  of  a.  Ele¬ 
ments  ofU  are  called  situations,  objects  or  rows,  interpreted  as,  e.g.,  cases,  states, 
patients,  observations.  We  also  consider  a  special  case  of  information  system:  de¬ 
cision  table  S  =  (U,  A  U  {d}),  where  d  ^  A  is  a  distinguished  attribute  called 
decision  and  the  elements  of  A  are  called  conditional  attributes  (conditions). 

In  a  given  information  system,  in  general,  we  are  not  able  to  distinguish  all 
pairs  of  situations  (objects)  by  using  attributes  of  the  system.  Namely,  differ¬ 
ent  situations  can  have  the  same  values  on  considered  attributes.  Hence,  any 
set  of  attributes  divides  the  universe  U  onto  some  classes  which  establish  a 
partition  of  U  ([5]).  With  any  subset  of  attributes  B  C  A  we  associate  a  bi- 
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naxy  relation  md(J3),  called  a  B-indiscemibility  relation,  which  is  defined  by 
ind{B)  =  {{u,u')  xU:  for  every  ae  B,  a{u)  =  a(w')}. 

Let  S  =  {U,A)  be  an  information  system.  Assume  that  U  =  {wi, wjv}, 
and  A  =  By  M(S)  we  denote  an  iV  x  iV  matrix  (cij),  called  the 

discemibility  matrix  of  S,  such  that  Cij  —  {a  e  A  :  a{ui)  /  o,{uj)}  for  i,j  = 
Discemibility  matrices  are  useful  for  deriving  possibly  small  subsets 
of  attributes,  still  keeping  the  knowledge  encoded  within  a  system.  Given  S  = 
{U,  A),  we  call  as  a  reduct  each  subset  B  C  A  being  minimal  in  sense  of  inclusion, 
intersecting  with  each  non-empty  Cjj,  i.e.,  such  that 

^9)  =>  {B  f)  Cij  ^  0) 

The  above  condition  states  the  reducts  as  minimal  subsets  of  attributes  which 
discern  all  pairs  of  objects  possible  to  be  discerned  within  an  information  sys¬ 
tem.  In  a  special  case,  for  decision  table  S  =  (f/,  A  U  {d}),  we  may  weaken  this 
condition,  because  not  all  pairs  are  necessary  to  be  discerned,  to  keep  knowl¬ 
edge  concerning  decision  d  -  we  modify  elements  of  corresponding  discemibility 
matrix  with  respect  to  formula 

[cij  =  {ae  A:  a{ui)  ^  a{uj)}  d{ui)  ^  d{uj)]  A  [cjj  =  d{ui)  =  d{uj)] 


In  this  paper  we  are  going  to  focus  on  decision  tables,  so,  from  now,  we  will 
understand  reducts  as  corresponding  to  such  modified  matrices. 

Extracting  reducts  from  data  is  a  crucial  task  in  view  of  tending  to  possibly 
clear  description  of  decision  in  terms  of  conditional  attributes.  In  view  of  the 
above  formulas,  such  a  description  can  be  regarded  as  deterministic,  relatively 
to  gathered  information  (one  can  show  that  the  above  definition  of  reduct  in 
a  decision  table  is  equivalent  to  that  based  on  generalized  decision  functions, 
considered,  e.g.,  in  [8]).  Still,  according  to  real  life  applications,  we  often  cannot 
afford  to  handle  subsets  of  conditions  defining  d  even  in  such  a  relative  way. 
Thus,  in  some  applications  (see  e.g.  [6]),  we  prefer  to  use  a-approximations  of 
reducts,  where  a  e  (0, 1]  is  a  real  parameter. 

We  consider  two  versions  of  such  approximations.  The  first  of  them  is  related 
to  the  task  of  discerning  almost  all  pairs  of  objects  with  different  decision  classes, 
regardless  of  information  provided  by  conditional  attributes:  The  set  of  attributes 
B  C  Ais  called  an  a-reduct  iff  it  is  minimal  in  sense  of  inclusion,  intersecting  at 
least  a  •  100%  of  pairs  necessary  to  be  discerned  with  respect  to  decision,  what 


means  that 


|{ci,j  :  B  n  Cjj  ^  0}| 
\{{ui,uf)  :  d{ui)  d{uj)\  “ 


Appropriate  tuning  of  parameter  a  in  the  above  inequality  provides  representa¬ 


tion  of  inconsistent  information,  alternative  to  approaches  based  on  generalized 
or  other  decision  functions,  proposed,  e.g.,  in  [8]  or  [9].  What  similar,  however,  is 
the  complexity  characteristics,  well  known  for  a  =  1,  of  the  following  problem: 


Theorem  1.  For  a  given  a  €  (0, 1),  the  problem  of  finding  the  minimal  (in  sense 
of  cardinality)  a-reduct  is  NP-hard  with  respect  to  the  number  of  conditional 
attributes. 
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Because  of  the  lack  of  space,  let  us  just  mention  that  the  proof  of  Theorem  1 
can  be  obtained  by  deriving  (in  polynomial  time)  the  problem  of  minimal  graph 
covering  (i.e.  the  problem  of  finding  minimal  set  of  vertices  which  cover  all  edges 
in  a  given  graph)  to  that  considered  above.  Let  us  illustrate  this  derivation  with 
the  following  example: 


Example  1 

Let  us  consider  the  Minimal  a-Reduct  Problem  for  c  =  0.8.  We  illustrate  the  proof 
of  Theorem  1  by  the  graph  G  =  (V,  E)  with  five  vertices  V  =  {vi ,  V2,  vs,  V4,  us}  and 
six  edges  E  =  {ci, 62, 63,64,65,66}.  First  we  compute  k  =  =  4.  Hence,  deci¬ 
sion  table  S(G)  consists  of  five  conditional  attributes  {a^i,  Ovg,  at/3,  de¬ 

cision  a*  and  (4-f-l)-l-6  =  11  objects  {iCi,a;2,sr3,®4,®*,Uei,Uc2,ac3,‘Ue4,‘Ue5,UeQ}- 
Decision  table  S(G)  constructed  from  the  graph  G  is  presented  below: 

Vi  e^  ^2 

S(G) 

Ovi 

Ot;2 

flvs 

UV4 

Wt/S 

0* 

X\ 

1 

1 

1 

1 

1 

1 

X2 

1 

1 

1 

1 

1 

1 

X3 

1 

1 

1 

1 

1 

1 

XA 

1 

1 

1 

1 

1 

1 

x* 

1“ 

T" 

1 

1“ 

T 

Wei 

“0“ 

T" 

~Y 

1 

1 

1 

— -r 

Uea 

0 

1 

1 

0 

1 

1 

%  \/^ 

V4 

ties 

1 

0 

1 

1 

0 

1 

Ue4 

1 

0 

1 

0 

1 

1 

Wes 

0 

1 

0 

1 

1 

1 

Wee 

1 

1 

0 

1 

0 

1 

Analogous  result  can  be  obtained  for  the  Minimal  Relative  a-Reduct  Problem, 
where  relative  a-reducts  are  understood  as  subsets  B  C  A  being  minimal  in 
sense  of  inclusion,  satisfying  inequality 


\{2AllSl£hl±M  >  a 

\{cij  :  Cij  ^  0}| 

In  such  a  case,  since  procedure  illustrated  by  Example  1  is  not  appropriate 
any  more,  we  have  to  use  more  sophisticated  representation  of  a  graph  by  a 
decision  table.  Instead  of  the  formal  proof,  again,  let  us  just  modify  the  previous 
illustration.  Appropriate  modification  can  be  seen  in  Example  2  below. 

Although  the  above  results  may  seem  to  reduce  the  possibilities  of  deal¬ 
ing  with  rough  set  tools  in  an  effective  way,  a  number  of  random  algorithms 
for  finding  approximately  optimal  solutions  to  mentioned  problems  can  be  pro¬ 
posed.  The  power  of  heuristics  possible  to  be  implemented  by  using  rough  set 
algorithmic  techniques  (see  e.g.  [4])  is  worth  remembering  because  the  majority 
of  interesting  data  mining  problems  is  known  to  be  NP-hard  as  well.  Thus,  the 
analysis  of  correspondence  between  such  problems  and  the  search  for  (approxi¬ 
mate)  reducts  can  turn  out  to  be  very  fruitful  in  view  of  many  applications. 
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Example  2 

In  case  of  the  Minimal  Relar 
tive  a-Reduct  Problem,  a  =  0.8, 
graph  G  =  (V',  E)  from  the  above 
Example  can  be  translated  to  de¬ 
cision  table  S^G),  where,  com¬ 
paring  to  S(G),  we  add  four  new 
conditions  oi ,  a^,  03, ai-  One  can 
show  that  from  a  given  minimal 
relative  a-reduct  in  S'(G)  we  can 
derive  (in  a  polynomial  time  with 
respect  to  the  number  of  condi¬ 
tions)  minimal  graph  covering  for 
G. 


S'(G) 

O'vi 

Ova 

Q'V4 

EM 

Ea 

^2 

^3 

a 

Xi 

D 

B 

B 

B 

B 

D 

□ 

□ 

□ 

1 

X2 

D 

B 

B 

B 

B 

□ 

D 

□ 

a 

1 

X3 

D 

B 

B 

B 

B 

D 

El 

□ 

□ 

1 

X4 

D 

B 

B 

B 

B 

El 

D 

□ 

a 

1 

X* 

1 

1 

1 

1 

1 

1 

1 

1 

1 

T 

tifil 
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"6“ 

IT 

1" 

_ 

T 

T 

T 

T 

T 

0 

1 

1 

0 

1 

1 

1 

1 

1 

1 
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0 

1 

1 

0 

1 

1 

1 

1 

1 

«C4 

1 

0 

1 

0 

1 

1 

1 

1 

1 

1 
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1 

0 

1 

1 

1 

1 

1 

1 

1 

nee 

1 

1 

0 

1 

0 

1 

1 

1 

1 

T] 

3  Approximate  association  rules 

Association  rules  and  their  generation  can  be  defined  in  many  ways  (see  [1]).  As 
we  mentioned  in  Introduction,  we  are  going  to  introduce  them  as  related  to  so 
called  templates. 

Given  an  information  table  S  =  (t/,A),  by  descriptors  we  mean  the  terms 
of  the  form  (a  =  i;),  where  o  e  A  is  an  attribute  and  v  G  Vo  is  a  value  in  the 
domain  of  a  (see  [4]).  The  notion  of  descriptor  can  be  generalized  by  using  terms 
of  the  form  (a  €  5),  where  5  C  is  a  set  of  values.  By  a  template  we  mean 
the  conjunction  of  descriptors,  i.e.  T  =  Di  A  Z>2  A ...  A  Dm^  where  Z?i,  are 
either  simple  or  generalized  descriptors.  We  denote  by  lengthiX)  number  of 
descriptors  in  T. 

An  object  w  €  If  satisfies  template  T  =  (oij  =  Vi)  A  ...  A  (a^^  =  Vm)  if  and 
only  if  Vjaij  (w)  =  Vj.  Hence,  template  T  describes  the  set  of  objects  having  the 
common  property:  ^Hhe  values  of  attributes  these  objects  are  equal 

to  Viy...,Vmf  respectively.  The  support  of  T  is  defined  by  8upport{T)  =  |{u  e 
U  :  u  satisfies  T}|. 

Long  templates  with  large  support  are  preferred  in  many  Data  Mining  tasks. 
Regarding  on  a  concrete  optimization  function,  problems  of  finding  optimal  large 
templates  are  known  as  being  NP-hard  with  respect  to  the  number  of  attributes 
involved  into  descriptors,  or  remain  open  problems  (see  e.g.  [3]).  Nevertheless, 
the  large  templates  can  be  found  quite  efficiently  by  Apriori  and  AprioriTid  al¬ 
gorithms  (see  [1],  [10])-  A  number  of  other  methods  for  large  template  generation 
has  been  proposed  e.g.  in  [4]. 

According  to  the  presented  notation,  association  rules  can  be  defined  as  im¬ 
plications  of  the  form  (P  =»  Q),  where  P  eind  Q  are  different  simple  templates. 
Thus,  they  take  the  form 

(ail  =  Vii )  A  - . .  A  (oifc  —Vi^)  (aji  =  ‘yji )  A  . . .  A  (a^,  =  Vji )  (1) 

Usually,  for  a  given  information  system  S,  the  quality  of  association  rule  %  = 
P  =»  Q  is  evaluated  by  two  measures  called  support  and  confidence  with  respect 
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to  S.  The  support  of  rule  U  is  defined  by  the  number  of  objects  from  S  satisfying 
condition  (P  A  Q),  i.e. 

supporti^  =  suppart{P  A  Q) 


The  second  measure, 
support  of  P,  i.e. 


confidence  of  71,  is  the  ratio  of  support  of  (P  A  Q)  and 
support^P  A  Q) 


c(mfidence{'R) 


support{P) 


The  following  problem  has  been  investigated  by  many  authors  (see  e.g.  [1],  [10]): 


For  a  given  information  table  S,  an  integer  s,  and  a  real 

NUMBER  C  €  (0,1),  FIND  AS  MANY  AS  POSSIBLE  ASSOCIATION  RULES 
7^  =  (P  Q)  SUCH  THAT  8upport{1l)  >  s  AND  confidence{Tl)  >  c. 


All  existing  association  rule  generation  methods  consist  of  two  main  steps: 

1.  Generate  as  many  as  possible  templates  T  =  Di  A  D2—  A  such  that 
suppart{T)  >  s  and  support{T  AD)  <s  for  any  descriptor  D  (i.e.  maximal 
templates  among  these  which  are  supported  by  not  less  than  s  objects). 

2.  For  any  template  T,  search  for  decomposition  T  =  P  A  Q  such  that: 

(a)  suppartiP)  < 

(b)  P  is  the  smallest  template  satisfying  (a). 

In  this  paper  we  show  that  the  second  above  step  can  be  solved  using  rough  set 
methods.  Let  us  assume  that  template  T  =  D1AD2A. .  .AD^,  which  is  supported 
by  at  least  s  objects,  has  been  found.  For  a  given  confidence  threshold  c  € 
(0, 1]  decomposition  T  =  P  A  Q  is  called  c-irreducible  if  confidence(P  =>  Q)  >  c 
and  for  any  decomposition  T  =  P'  A  Q'  such  that  P'  is  a  sub-template  of  P, 
confidence{P'  =>  Q')  <  c. 

We  are  especially  interested  in  approximate  association  rules,  corresponding 
to  c  <  1.  The  following  gives  analogy  of  this  case  to  well  known  result  concerning 
the  search  for  deterministic  association  rules. 

Theorem  2.  For  a  fixed  c  €  (0, 1),  the  problem  of  searching  for  the  shortest 
association  rule  from  the  template  T  for  a  given  table  S  with  confidence  limited 
by  c  (Optimal  c- Association  Rule  Problem)  is  NP-hard,  with  respect  to  the  length 
ofT. 

The  proof  of  this  theorem  is  similar  to  that  of  Theorem  1.  We  illustrate  it  by 
example; 

Example  3 

Let  us  consider  the  Optimal  oAssociation  Rules  Problem  for  c  =  0.8.  We  illustrate 
the  proof  of  Theorem  2  analogously  to  the  illustration  from  Example  1,  related 
to  the  proof  of  Theorem  1.  For  graph  G  =  (V,  E),  we  compute  k  =  =  4  like 

previously.  The  only  difference  is  that  instead  of  decision  table  S(G)  we  begin  to 
consider  information  system  S^^(G),  where  a*  is  a  non-decision  attribute,  like  the 
others.  _ _ _ _ _ 
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4  Application  of  approximate  reducts  to  the  search  of 
approximate  association  rules 


In  this  section  we  are  going  to  show  that  the  problem  of  searching  for  optimal  ap¬ 
proximate  association  rules  from  the  given  template  is  equivalent  to  the  problem 
of  searching  for  minimal  a-reducts  in  an  appropriately  modified  decision  table. 
We  construct  new  decision  table  S|t  =  U  d)  from  original  information 

table  S  and  template  T  as  follows: 


1. 

2. 


AIt  =  {aui  ,oz>2,  — is  the  set  of  attributes  corresponding  to  the  de- 
,  ,,  ,  ,  .  f  1  if  object  u  satisfies  Di, 

scriptors  of  T,  such  that  ao,  («)  =  |  q  otherwise. 

Decision  attribute  d  determines  whether  a  given  object  satisfies  template  T, 


i.e.rf(u)  =  |j; 


if  object  u  satisfies  T, 
otherwise. 


The  following  theorem  describes  the  relationship  between  the  optimal  association 
rule  and  the  minimal  reduct  search  problems. 


Theorem  3.  For  information  table  S  =  (U,A),  template  T,  set  of  descriptors  P 
and  parameter  c  €  (0,1],  implication  ®  c-irreducihle 

association  rule  from  T  iffV  is  an  a-reduct  in  S|t,  fora  =  l-{^  -  l)/{~  -  1), 
where  N  is  the  total  number  of  objects  in  U  and  s  =  8upport{T).  In  particular, 
the  above  implication  is  a  deterministic  association  rule  iffP  is  a  reduct  in  S|t- 

The  following  example  illustrates  the  main  idea  of  the  method  based  on  the 
above  characteristics.  Let  us  consider  the  following  information  table  S  with  18 
objects  and  9  attributes: 


y 

Di 

ai  =  0 
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Assume  that  template 

T  =  (ai  =  0)  A  (as  =  2)  A  (04  =  1)  A  (ae  =  0)  A  (as  =  1) 

has  been  extracted  from  information  table  S.  One  can  see  that  support (T)  =  10 
and  lenffth{T)  =  5.  Decision  table  S|t  is  presented  below: 


The  discernibility  function  /  corresponding  to  matrix  M(S|t)  is  the  following: 

/  =  (D2  V  D4  V  Ds)  A  (Di  V  Ds  V  D4)  A  (D2  V  D3  V  D4)  A  (JDi  V  D2  V  D3  V  D4) 
A(Di  V  Ds  V  Ds)  A  (Da  V  D3  V  Dg)  A  (D3  V  D4  V  Dg)  A  (Di  V  Dg) 

After  simplification  we  obtain  six  reducts:  /  =  (D3  A  Dg)  V  (D4  A  Dg)  V  (Di  A 
D2  A  D3)  V  (Di  A  D2  A  D4)  V  (Di  A  D2  A  Dg)  V  (Di  A  D3  A  D4)  for  decision  table 
S|t-  Thus,  we  have  found  from  T  six  deterministic  association  rules  with  full 
confidence. 

For  c  =  0.9,  we  would  like  to  find  a-reducts  for  decision  table  S|t,  where 
a  =  1--  =  0.86.  Hence  we  search  for  a  set  of  descriptors  covering  at  least 

f(n  -  5)(a)l  =  fS  •  0.86]  =  7  elements  of  discernibility  matrix  M(S|t).  One 
can  see  that  each  of  the  following  sets  {Di,  D2},  {Di,  D3},  {Di,  D4},  {Di,  Dg}, 
{D2,  D3},  {D2,Dg},  {D3,  D4}  intersects  with  exactly  7  members  of  discernibility 
matrix  M(S|t)-  Hi  the  above  table  we  present  all  association  rules  achieved  from 
these  sets. 

The  problems  of  finding  both  minimal  a-reducts  and  all  a-reducts  are  NP- 
hard,  so  we  usually  cannot  afford  for  such  exhaustive  computations  like  these 
presented  above.  However,  one  should  remember  that  the  above  is  just  an  illus¬ 
trative  example  and  in  real  life  applications  we  can  use  very  efficient  and  fast 
heuristics  for  solving  a-reduct  problems  (see  e.g.  [4]  for  further  references).  In 
particular,  it  makes  presented  derivation  very  important  tool  for  development  of 
appropriate  methods  of  finding  optimal  approximate  association  rules. 
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5  Conclusions 

Searching  for  minimal  a-reducts  is  well  known  problem  in  Rough  Sets  theory.  A 
great  effort  has  been  involved  to  solve  these  problems.  One  can  find  numerous 
applications  of  a-reducts  in  the  knowledge  discovery  domain.  In  this  paper  we 
have  shown,  that  the  problem  of  searching  for  the  shortest  a-reducts  is  NP-hard. 
We  also  investigated  the  application  of  a-reducts  to  association  rule  generation. 
Still,  further  development  of  the  language  of  association  rules  is  needed  for  appli¬ 
cations.  In  the  next  papers  we  are  going  to  present  such  a  development,  together 
with  new,  rough  set  based  algorithms  for  the  association  rule  generation. 
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Abstract.  Rough  sets  proved  to  be  very  useful  for  analysis  of  decision 
problems  concerning  objects  described  in  a  data  table  by  a  set  of  condition 
attributes  and  by  a  set  of  decision  attributes.  In  practical  applications,  however, 
the  data  table  is  often  not  complete  because  some  data  are  missing.  To  deal  with 
this  case,  we  propose  an  extension  of  the  rough  set  methodology.  The 
adaptation  concerns  both  the  classical  rough  set  approach  based  on 
indiscernibility  relations  and  the  new  rough  set  approach  based  on  dominance 
relations.  While  the  first  approach  deals  with  multi-attribute  classification 
problems,  the  second  approach  deals  with  multi-criteria  sorting  problems.  The 
adapted  relations  of  indiscernibility  or  dominance  between  two  objects  are 
considered  as  directional  statements  where  a  subject  is  compared  to  a  referent 
object  having  no  missing  values.  The  two  rough  set  approaches  handling  the 
missing  values  boil  down  to  the  original  approaches  when  the  data  table  is 
complete.  The  rules  induced  from  the  rough  approximations  are  robust  in  a 
sense  that  each  rule  is  supported  by  at  least  one  object  with  no  missing  values 
on  condition  attributes  or  criteria  used  by  the  rule. 


1  Introduction 

The  rough  sets  philosophy  introduced  by  Pawlak  [5,  6]  is  based  on  the  assumption 
that  with  every  object  of  the  universe  there  is  associated  a  certain  amount  of 
information  (data,  knowledge),  expressed  by  means  of  some  attributes  used  for  object 
description.  It  proved  to  be  an  excellent  tool  for  analysis  of  decision  problems  [7,  10] 
where  the  set  of  attributes  is  divided  into  disjoint  sets  of  condition  attributes  and 
decision  attributes  describing  objects  in  a  data  table. 

The  key  idea  of  rough  sets  is  approximation  of  knowledge  expressed  by  decision 
attributes  using  knowledge  expressed  by  condition  attributes.  The  rough  set  approach 
answers  several  questions  related  to  the  approximation:  (a)  is  the  information 
contained  in  the  data  table  consistent?  (b)  what  are  the  non-redundant  subsets  of 
condition  attributes  ensuring  the  same  quality  of  approximation  as  the  whole  set  of 
condition  attributes?  (c)  what  are  the  condition  attributes  which  cannot  be  eliminated 
from  the  approximation  without  decreasing  the  quality  of  approximation?  (d)  what 
minimal  then ...”  decision  rules  can  be  induced  from  the  approximations? 
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The  original  rough  set  approach  is  not  able,  however,  to  discover  and  process 
inconsistencies  coming  from  consideration  of  criteria,  i.e.  condition  attributes  with 
preference-ordered  scales.  For  this  reason,  Greco,  Matarazzo  and  Slowinski  [1,  2] 
have  proposed  a  new  rough  set  approach  that  is  able  to  deal  with  inconsistencies 
typical  to  consideration  of  criteria  and  preference-ordered  decision  classes.  This 
innovation  is  mainly  based  on  substitution  of  the  indiscemibility  relation  by  a 
dominance  relation  in  the  rough  approximation  6f  decision  classes.  An  important 
consequence  of  this  fact  is  a  possibility  of  inferring  from  exemplary  decisions  the 
preference  model  in  terms  of  decision  rules  being  logical  statements  of  the  type 
then.."'.  The  separation  of  certain  and  doubtful  knowledge  about  the  decision 
maker’s  preferences  is  done  by  distinction  of  different  kinds  of  decision  rules, 
depending  whether  they  are  induced  from  lower  approximations  of  decision  classes 
or  from  the  boundaries  of  these  classes  composed  of  inconsistent  examples  that  do 
not  observe  the  dominance  principle.  Such  preference  model  is  more  general  than  the 
classical  functional  or  relational  model  in  multi-criteria  decision  making  and  it  is 
more  understandable  for  the  users  because  of  its  natural  syntax. 

Both  the  classical  rough  set  approach  based  on  the  use  of  indiscemibility  relations 
and  the  new  rough  set  approach  based  on  the  use  of  dominance  relations  suffer, 
however,  from  another  deficiency:  they  require  the  data  table  to  be  complete,  i.e. 
without  missing  values  on  condition  attributes  or  criteria  describing  the  objects. 

To  deal  with  the  case  of  missing  values  in  the  data  table,  we  propose  an  adaptation 
of  the  rough  set  methodology.  The  adaptation  concerns  both  the  classical  rough  set 
approach  and  the  dominance-based  rough  set  approach.  While  the  first  approach  deals 
with  multi-attribute  classification,  the  second  approach  deals  with  multi-criteria 
sorting.  Multi-attribute  classification  concerns  an  assignment  of  a  set  of  objects  to  a 
set  of  pre-defined  classes.  The  objects  are  described  by  a  set  of  (regular)  attributes  and 
the  classes  are  not  necessarily  ordered.  Multi-criteria  sorting  concerns  a  set  of  objects 
evaluated  by  criteria,  i.e.  attributes  with  preference-ordered  scales.  In  this  problem, 
the  classes  are  also  preference-ordered. 

The  adapted  relations  of  indiscemibility  or  dominance  between  two  objects  are 
considered  as  directional  statements  where  a  subject  is  compared  to  a  referent  object. 
We  require  that  the  referent  object  has  no  missing  values.  The  two  adapted  rough  set 
approaches  maintain  all  good  characteristics  of  their  original  approaches.  They  also 
boil  down  to  the  original  approaches  when  there  are  no  missing  values.  The  rules 
induced  from  the  rough  approximations  defined  according  to  the  adapted  relations 
verify  some  suitable  properties:  they  are  either  exact  or  approximate,  depending 
whether  they  are  supported  by  consistent  objects  or  not,  and  they  are  robust  in  a  sense 
that  each  rule  is  supported  by  at  least  one  object  with  no  missing  value  on  the 
condition  attributes  or  criteria  represented  in  the  rule. 

The  paper  is  organized  in  the  following  way.  In  section  2,  we  present  the  extended 
rough  sets  methodology  handling  the  missing  values.  It  is  composed  of  four  sub¬ 
sections  -  first  two  are  devoted  to  adaptation  of  the  classical  rough  set  approach  based 
on  the  use  of  indiscemibility  relations;  the  other  two  undertake  the  adaptation  of  the 
new  rough  set  approach  based  on  the  use  of  dominance  relations.  In  order  to  illustrate 
the  concepts  introduced  in  section  2,  we  present  an  illustrative  example  in  section  3. 
Section  4  groups  conclusions. 
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2  Rough  approximations  defined  on  data  tables  with  missing  values 

For  algorithmic  reasons,  the  data  set  about  objects  is  represented  in  the  form  of  a  data 
table.  The  rows  of  the  table  are  labelled  by  objects,  whereas  columns  are  labelled  by 
attributes  and  entries  of  the  table  are  attribute-values,  called  descriptors. 

Formally,  by  a  data  table  we  understand  the  4-tuple  S=<U,Q,V,f>,  where  U  is  a 
finite  set  of  objects,  Q  is  a  finite  set  of  attributes,  V  =  (JVq  and  Vq  is  a  domain  of 

qeQ 

the  attribute  q,  and  fUxQ^V  is  a  total  function  such  that  f(x,q)eVqU{*}  for  every 

qeQ,  xgU,  called  an  information  function.  The  symbol  indicates  that  the  value 
of  an  attribute  for  a  given  object  is  unknown  {missing). 

If  set  Q  is  divided  into  set  C  of  condition  attributes  and  set  D  of  decision 
attributes,  then  such  a  data  table  is  called  decision  table.  If  the  domain  (scale)  of  a 
condition  attribute  is  ordered  according  to  a  decreasing  or  increasing  preference,  then 
it  is  a  criterion.  For  condition  attribute  qeC  being  a  criterion,  Sq  is  an  outranking 
relation  [8]  on  U  such  that  xSqy  means  “x  is  at  least  as  good  as  y  with  respect  to 
criterion  q”.  We  suppose  that  Sq  is  a  total  preorder,  i.e.  a  strongly  complete  and 
transitive  binary  relation,  defined  on  U  on  the  basis  of  evaluations  f(-,q).  The 
domains  of  “regular”  condition  attributes  are  not  ordered. 

We  assume  that  the  set  D  of  decision  attributes  is  a  singleton  {d}.  Decision 
attribute  d  makes  a  partition  of  U  into  a  finite  number  of  classes  CI={Cl{,  teT}, 
T={l,...,n},  such  that  each  xgU  belongs  to  one  and  only  one  ClteCI.  The  domain  of 
d  can  be  preference-ordered  or  not.  In  the  former  case,  we  suppose  that  the  classes 
are  ordered  such  that  the  higher  is  the  class  number  the  better  is  the  class,  i.e.  for  all 
r,sGT,  such  that  r>s,  the  objects  from  Clr  are  preferred  (strictly  or  weakly)  to  the 
objects  from  Clj.  More  formally,  if  S  is  a  comprehensive  outranking  relation  on  U, 
i.e.  if  for  all  x,y€U,  xSy  means  “x  is  at  least  as  good  as  y”,  we  suppose:  [xeClr, 
yeCls,  r>s]  =>  [xSy  and  not  ySx].  These  assumptions  are  typical  for  consideration  of 
a  multi-criteria  sorting  problem. 

In  the  following  sub-sections  of  this  section  we  are  considering  separately  the 
multi-attribute  classification  and  the  multi-criteria  sorting  with  respect  to  the  problem 
of  missing  values.  The  first  idea  of  dealing  with  missing  values  in  the  rough  set 
approach  to  the  multi-attribute  classification  problem  in  the  way  described  below  has 
been  given  in  [3]. 


2.1  Multi-attribute  classification  problem  with  missing  values 

For  any  two  objects  x,y6U,  we  are  considering  a  directional  comparison  of  y  to  x; 
object  y  is  called  subject  and  object  x,  referent.  We  say  that  subject  y  is  indiscernible 
with  referent  x  with  respect  to  condition  attributes  PcC  (denotation  yipx)  if  for  every 
qeP  the  following  conditions  are  met: 

-  f{x,q)^* , 

■  f(x,q)=f(y,q)  or  f(y  ,q)=* . 

The  above  means  that  the  referent  object  considered  for  indiscernibility  with 
respect  to  P  should  have  no  missing  values  on  attributes  from  set  P. 
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The  binary  relation  Ip  is  not  necessarily  reflexive  because  for  some  xgU  there  may 
exist  qeP  for  which  f(x,q)=*  and,  therefore,  we  cannot  state  xipx.  Moreover,  Ip  is  also 
not  necessarily  symmetric  because  the  statement  yipx  cannot  be  inverted  if  there  exist 
qeP  for  which  f(y,q)=*.  However,  Ip  is  transitive  because  for  each  x,y,ZGU,  the 
statements  xipy  and  yipz  imply  xipz.  This  is  justified  by  the  observations  that  object  z 
can  substitute  object  y  in  the  statement  xIpy  because  ylpZ  and  both  y  and  z,  as  referent 
objects,  have  no  missing  values. 

For  each  PeC  let  us  define  a  set  of  objects  having  no  missing  values  on  qeP: 

Up={xeU:  for  each  qe?}. 

It  is  easy  to  see  that  the  restriction  of  Ip  to  Up  (in  other  words,  the  binary  relation 
IpnUpxUp  defined  on  Up)  is  reflexive,  symmetric  and  transitive,  i.e.  it  is  an 
equivalence  binary  relation. 

For  each  xeU  and  for  each  PcQ  let  Ip(x)={yGU:  yIpx}  denote  the  class  of  objects 
indiscernible  with  x.  Given  XcU  and  PcQ,  we  define  lower  approximation  of  X  with 
respect  to  P  as 

Ip(X)={xGUp:  lp(x)cX}.  (1) 

Analogously,  we  define  the  upper  approximation  of  X  with  respect  to  P  as 

Ip  (X)={xGUp!  Ip(x)oX^0} .  (2) 

Let  us  observe  that  if  xGUp  then  lp(x)=0  and,  therefore,  we  can  also  write 
!p(X)={xGU:  Ip(x)nX;^0}. 

Let  Xp=XnUp.  For  each  XgU  and  for  each  PcC:  Ip(X)cXpc!p(X)  {rough 
inclusion)  and  Ip(X)=Up  -  Ip(U-X)  {complementarity). 

The  ^-boundary  of  X  in  S,  denoted  by  Bnp(X),  is  equal  to  Bnp(X)=  Ip  (X)  -  Ip(X). 

Bnp(X)  constitutes  the  "doubtful  region"  of  X:  according  to  knowledge  expressed 
by  P  nothing  can  be  said  with  certainty  about  membership  of  its  elements  in  the  set  X. 

The  following  concept  will  also  be  useful  [9].  Given  a  partition  Cl={Cit,  tGT}, 
T={l,...,n},  of  U,  the  P-boundary  with  respect  to  k>l  classes  {Clti,...,Cltk}c 
{Cli,...,Cln}  is  defined  as 

Bdp({CI,„...,CW)=  nBnp(ci,)  n  n(U-Bnp(CI,)) 

^t=tl . tk  y  . tk 

The  objects  from  Bdp({Clti,...,Cltk})  can  be  assigned  to  one  of  the  classes 
Clti,...,Cltk  but  P  provides  not  enough  information  to  know  exactly  to  what  class. 

Let  us  observe  that  a  very  useful  property  of  lower  approximation  within  classical 
rough  sets  theory  is  that  if  an  object  xgU  belongs  to  the  lower  approximation  with 
respect  to  PcC,  then  x  belongs  also  to  the  lower  approximation  with  respect  to  RcC 
when  PcR  (this  is  a  kind  of  monotonicity  property).  However,  definition  (1)  does  not 
satisfy  this  property  of  lower  approximation,  because  it  is  possible  that  f(x,q)?t*  for  all 
qGP  but  f(x,q)=*  for  some  qGR-P.  This  is  quite  problematic  with  respect  to  definition 
of  some  key  concepts  of  the  rough  sets  theory,  like  quality  of  approximation,  reduct 
and  core. 

Therefore,  another  definition  of  lower  approximation  should  be  considered  to 
restore  the  concepts  of  quality  of  approximation,  reduct  and  core  in  the  case  of 
missing  values.  Given  XcU  and  PcQ,  this  definition  is  the  following: 
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IpTO^UIrW.  (3) 

RcP 

Ip  (X)  will  be  called  cumulative  P-lower  approximation  of  X  because  it  includes 
all  the  objects  belonging  to  all  R-lower  approximations  of  X,  where  RcP. 

It  can  be  shown  that  another  type  of  indiscernibility  relation,  denoted  by  Ip , 
permits  a  direct  definition  of  the  cumulative  P-lower  approximation  in  a  classic  way. 
For  each  x,yGU  and  for  each  PcQ,  y  Ip  x  means  that  f(x,q)=f(y,q)  or  f(x,q)=*  and/or 
f(y,q)=*,  for  every  qeP.  Let  ip  (x)={yGU:  yip  x}  for  each  xgU  and  for  each  PcQ. 
Ip  is  a  reflexive  and  symmetric  but  not  transitive  [4].  Let  us  observe  that  the 
restriction  of  Ip  to  Up  is  reflexive,  symmetric  and  transitive  when  Up 
f(x,q)?«i*  for  at  least  one  qeP}. 

Theorem  1.  {Definition  (3)  expressed  in  terms  of  Ip  )  Ip  (X)={xe  Up  •  Ip  (x)cX}. 
Using  Ip  we  can  give  definition  of  the  P-upper  approximation  of  X: 

i;(X)={xGU*p:  I*p(x)nX;^0}.  (4) 

For  each  XcU,  let  Xp  =XnUp  •  Let  us  remark  that  xg  Up  if  and  only  if  there 
exists  such  that  RcP  and  xgUr.  For  each  XcU  and  for  each  PcC: 

Ip  (X)c  Xp  Q  Ip  (X)  {rough  inclusion)  and  Ip  (X)=  Up  -  Ip  (fl-X)  {complementarity). 

The  P-boundary  of  X  approximated  with  ip  is  equal  to  Bnp  (X)=  Ij  (X)  -  Ip  (X). 
Given  a  partition  Cl={Clt,  tGT},  T={l,...,n},  of  U,  the  P>boundary  with  respect  to 
k>l  classes  {Clti,...,Cltk}c  {CI|,...,Cln}  is  defined  as 

Bd*p({Cl,„...,CI,k})=f  nBn‘p(ci,)lnf  n(u -Bn*p(ci.))l . 

Vt=ti,...,tk  y  yt^t\ . tk  J 

The  objects  from  Bdp  ({CIti,..,,Cltk})  can  be  assigned  to  one  of  the  classes 
Clti,...,Cltk  ,  however,  P  and  all  its  subsets  provide  not  enough  information  to  know 
exactly  to  what  class. 

Theorem  2,  {Monotonicity  of  the  accuracy  of  approximation)  For  each  XcU  and 
for  each  P,TcC,  such  that  PcT,  the  following  inclusion  holds:  i)  Ip  (X)  c  Ij  (X). 

Furthermore,  if  Up  ^  Ut  » Ih®  following  inclusion  is  also  true:  ii)  I J  (Xjci^  (X). 

Due  to  Theorem  2,  when  augmenting  a  set  of  attributes  P,  we  get  a  lower 
approximation  of  X  that  is  at  least  of  the  same  cardinality.  Thus,  we  can  restore  for 
the  case  of  missing  values  the  key  concepts  of  the  rough  sets  theory:  accuracy  and 
quality  of  approximation,  reduct  and  core. 

2.2  Decision  rules  for  multi-attribute  classification  with  missing  values 

Using  the  rough  approximations  (1),  (2)  and  (3),  (4),  it  is  possible  to  induce  a 
generalized  description  of  the  information  contained  in  the  decision  table  in  terms  of 
decision  rules.  These  are  logical  statements  (implications)  of  the  type  "//...,  then...'\ 
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where  the  antecedent  (condition  part)  is  a  conjunction  of  elementary  conditions 
concerning  particular  condition  attributes  and  the  consequence  (decision  part)  is  a 
disjunction  of  possible  assignments  to  particular  classes  of  a  partition  of  U  induced 
by  decision  attributes.  Given  a  partition  CI={Clt,  teT},  T={l,...,n},  of  U,  the  syntax 
of  a  rule  is  the  following: 

"//f(x,q,)  =  rqi  Wf(x,q2)  =  rq2  and  =  then  x  is  assigned  to  Cln  or 

Clt2  or ...  Cltk", 

where  {qi,q2v>qp}QG,  (rqj,rq2v5*’qp)^Vqi^Vq2^”'^^qp  {Clti,Clt2,...>Cltk}Q{Cli, 
Ci2,...,Cin}.  If  the  consequence  is  univocal,  i.e.  k=l,  then  the  rule  is  exact,  otherwise  it 
is  approximate  or  uncertain. 

Let  us  observe  that  for  any  Clte{Cl,,Cl2,...,Cl„}  and  PcQ,  the  definition  (1)  of  P- 
lower  approximation  of  Clt  can  be  rewritten  as: 

Ip(Clt)={xeUp:  for  each  yeU,  ify\^xJhenyeC\^}.  (T) 

Thus  the  objects  belonging  to  the  lower  approximation  Ip(Cl,)  can  be  considered  as 
a  basis  for  induction  of  exact  decision  rules. 

Therefore,  the  statement  "//f(x,qi)  =  rqi  and  f(x,q2)  =  rq2  and ...  f(x,qp)  =  rqp,  then 
X  is  assigned  to  Clt",  *s  accepted  as  an  exact  decision  rule  iff  there  exists  at  least  one 

y€  Ip  (Clt) ,  P={qi,...,qp},  such  that  f(y,q,)  =  rq,  and  f(y,q2)=rq2  and  ...  f(y,qp)=rqp. 
Given  {Cl,i,...,Cltk}e{Cl,,Cl2,...,Cl„}  we  can  write: 

Bdp({Clti,...,Cltk})={xeUp:  for  each  yeU,  if  y\^\,  then  y^C\\  or  ...  Cltk}.  (2’) 
Thus,  the  objects  belonging  to  the  boundary  Bdp({Clti,...,Cltk})  can  be  considered 
as  a  basis  for  induction  of  approximate  decision  rules. 

Since  each  decision  rule  is  an  implication,  a  minimal  decision  rule  represents  such 
an  implication  that  there  is  no  other  implication  with  an  antecedent  of  at  least  the 
same  weakness  and  a  consequent  of  at  least  the  same  strength. 

We  say  that  yeU  supports  the  exact  decision  rule  if  [f(y,q])=rqi  and/or  f(y,qi)=*] 
and  [f(y,q2)=rq2  and/or  f(y,q2)=*]  ...  and  [f(y,qp)=rqp  and/or  f(y,qp)=*  ]  and  yeClt, 
Similarly,  we  say  that  yeU  supports  the  approximate  decision  rule  if  [f(y,qi)=rqi 
and/or  f(y,qi)=*]  and  [f(y,q2)=rq2  and/or  f(y,q2)=*]  and  [f(y,qp)=rqp  and/or 

f(y>qp)=*]  and  y€  Bdc({Cl,t,...,Ckk}). 

A  set  of  decision  rules  is  complete  if  it  fulfils  the  following  conditions: 

-  each  XG  (Clt)  supports  at  least  one  exact  decision  rule  suggesting  an  assignment 
to  Clt,  foi"  Gach  CltGCl, 

-  each  XG  Bdc({Clti,...,Cltk})  supports  at  least  one  approximate  decision  rule 
suggesting  an  assignment  to  Clti  or  Clt2  or  ...  Cltk,  for  each  {Clti,Clt2,...,Cltk}c 
{Cli,Cl2,...,Cln}. 

We  call  minimal  each  set  of  minimal  decision  rules  that  is  complete  and  non- 
redundant,  i.e.  exclusion  of  any  rule  from  this  set  makes  it  non-complete. 


2.3  Multi-criteria  sorting  problem  with  missing  values 

Formally,  for  each  qGC  being  a  criterion  there  exists  an  outranking  relation  [8]  Sq  on 
the  set  of  objects  U  such  that  xSqy  means  “x  is  at  least  as  good  as  y  with  respect  to 
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criterion  q”.  We  suppose  that  Sq  is  a  total  preorder,  i.e.  a  strongly  complete  and 
transitive  binary  relation  defined  on  U  on  the  basis  of  evaluations  f(-,q).  Precisely,  we 
assume  that  xSqy  iff  f(x,q)>f(y,q). 

Also  in  this  case,  we  are  considering  a  directional  comparison  of  subject  y  to 
referent  x,  for  any  two  objects  x,yeU.  We  say  that  subject  y  dominates  referent  x 

with  respect  to  criteria  PcC  (denotation  yDpx)  if  for  every  criterion  qeP  the 
following  conditions  are  met: 

■  f(x,q)?t*, 

■  f(y,q)^f(x,q)  or  f(y,q)=*. 

We  say  that  subject  y  is  dominated  by  referent  x  with  respect  to  criteria  PcC 
(denotation  x  Dpy)  if  for  every  criterion  qsP  the  following  conditions  are  met: 

■  f(x,q)^*, 

■  f(x,q)>f(y,q)orf(y,q)=*. 

The  above  means  that  the  referent  object  considered  for  dominance  Dp  and  Dp 
should  have  no  missing  values  on  criteria  from  set  P. 

The  binary  relations  Dp  and  Dp  are  not  necessarily  reflexive  because  for  some 
xgU  there  may  exist  qeP  for  which  f(x,q)=*  and,  therefore,  we  cannot  state  neither 
xDpx  nor  xDpX.  However,  Dp  and  Dp  are  transitive  because  for  each  x,y,z€U, 

(i)  xDjy  and  yDpZ  imply  xDpZ,  and  (ii)  xDfy  and  yDpZ  imply  xDJz  . 
Implication  (i)  is  Justified  by  the  observation  that  object  z  can  substitute  object  y  in 

the  statement  xDpy  because  yDpZ  and  both  y  and  z,  as  referent  objects,  have  no 
missing  values.  As  to  implication  (ii),  object  x  can  substitute  object  y  in  the  statement 

y  Dp  z  because  x  Dp  y  and  both  x  and  y,  as  referent  objects,  have  no  missing  values. 

For  each  PqC  we  restore  the  definition  of  set  Up  from  sub-section  2.1.  It  is  easy  to 
see  that  the  restrictions  of  Dp  and  Dp  to  Up  (in  other  words,  the  binary  relations 

Dp  nUpxUp  and  DpnUpxUp  defined  on  Up)  are  reflexive  and  transitive,  i.e.  they  are 
partial  preorders. 

The  sets  to  be  approximated  are  called  upward  union  and  downward  union  of 
preference-ordered  classes,  respectively: 

ClHUCls,  CI,-=UCls.  t=l,...,n. 

s>t  s<t 

The  statement  Clf  =  UCIs  means  "x  belongs  at  least  to  class  CV,  while 

s>t 

Clf  =  Ucis  means  "x  belongs  at  most  to  class  CV'. 

s^t 

Let  us  remark  that  Clf^Cln^C,  Cln^CV  and  Clf^Cli.  Furthermore,  for 

t=2,...,n,  we  have  cif  =U-  ClJ  and  CiS  =U-  Clf  • 

Given  PcC  and  xeU,  the  “granules  of  knowledge”  used  for  approximation  are: 

-  a  set  of  objects  dominating  x,  called  ?-dominating  set,  Dp(x)  ={yeU:  y  Dp  x}, 
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-  a  set  of  objects  dominated  by  x,  called  ? -dominated  set,  ={yeU:  x  Dp  y}- 

For  any  PcC  we  say  that  xeU  belongs  to  Clf  without  any  ambiguity  if  xe  Clf 
and  for  all  the  objects  ysU  dominating  x  with  respect  to  P,  we  have  ye  Clf ,  i-e. 
Dp(x)cClf  •  Furthermore,  we  say  that  xeU  could  belong  to  Clf  if  there  would 
exist  at  least  one  object  ye  Clf  dominated  by  x  with  respect  to  P,  i.e.  ye  Dp(x) . 

Thus,  with  respect  to  PcC,  the  set  of  all  objects  belonging  to  Clf  without  any 
ambiguity  constitutes  the  ?-lower  approximation  of  Clf ,  denoted  by  P(Clf) »  and  the 
set  of  all  objects  that  could  belong  to  Clf  constitutes  the  ?-upper  approximation  of 


Clf ,  denoted  by  P(Clf ) ,  for  t=l  ,...,n: 

P(Clf)=  {xGUp:Dp(x)cClf 

P(Clf)=  {xeUp:Dp(x)ncif  =5^0}-  (5-2) 

Analogously,  one  can  define  ?-lower  approximation  and  ?-upper  approximation 

of  Clf  ,fort=l,...,n: 

P(Clf)={xeUp:Dp(x)cClf },  (6.1) 

P(Clf )  ={xeUp:  Dp(x)ncif  ^0}.  (6-2) 


Let  (Clf)p=ClfnUp  and  (Clf )p=Clf '^Lp,  t=l,...,n.  For  each  Clf  and  Clf, 
t=l,...,n,  and  for  each  PcC:  P(Clf )  c(cif  )pcP(Clf ) ,  P(Clf  )c(Clf  )pc  P(Clf ) 
{rough  inclusion).  Moreover,  for  each  Clf,  t=2,...,n,  and  Clf,  t=l,...,n-l,  and  for 
each  PcC:  P(Clf )  =  Up  - P(Clf_i) ,  ?{C\-,)  {complementarity). 

The  ?-boundaries  (P>doubtful  regions)  of  Clf  and  Clf  are  defined  as: 

Bnp(  Clf  )=  P(Clf )  -  P(Clf ) ,  Bnp(  Clf  )=  P(Clf )  -  P(Clf ) ,  for  t=l  ,...,n. 

Due  to  complementarity  of  the  rough  approximations  [1],  the  following  property 
holds:  Bnp(Clf)=Bnp(Clf_,),  for  t=2,...,n,  and  Bnp(Clf  )=Bnp(Clf,i ),  for  t=l,...,n-l. 

To  preserve  the  monotonicity  property  of  the  lower  approximation  (see  sub¬ 
section  2.1)  it  is  necessary  to  use  another  definition  of  the  approximation  for  a  given 

Clf  and  Clf  ,t=l,...,n,  and  for  each  PcC: 

P(cir)*=  UR(ci?).  (7-0 

RcP 

E(Clf)*=  UR(Clf).  (7-2) 

RcP 

P(Clf)*  and  P(Clf)*  will  be  called  cwmw/ar/ve  P-lower  approximations  of  unions 
Clf  and  Clf ,  t=l,...,n,  because  they  include  all  the  objects  belonging  to  all  R-lower 
approximations  of  Clf  and  Clf ,  respectively,  where  RcP. 

It  can  be  shown  that  another  type  of  dominance  relation,  denoted  by  Dp ,  permits  a 
direct  definition  of  the  cumulative  P-lower  approximations  in  a  classical  way.  For 
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each  x,yeU  and  for  each  PeQ,  yDpX  means  that  f(y,q)>f(x,q)  or  f(x,q)=*  and/or 
f(y,q)-*,  for  every  qeP. 

Given  PcC  and  xeU,  the  “granules  of  knowledge”  used  for  approximation  are: 

-  a  set  of  objects  dominating  x,  called  ?-dommating  set.  Dp*  (x)={yeU:  y  Dp  x}, 

-  a  set  of  objects  dominated  by  x,  called  P-dominated  set,  Dp*  (x)={y€U:  x  Dp  y}- 
Dp  is  reflexive  but  not  transitive.  Let  us  observe  that  the  restriction  of  Dp  to  up 

is  reflexive  and  transitive  when  Up  ={xgU:  f(x,qy*  for  at  least  one  qeP}. 

Theorem  3.  {Definitions  (7.1)  and  (7.2)  expressed  in  terms  of  Dp ) 

P(Clf)*={x6U*p:  Df*(x)cClf},  P(Clf)’={x6U*p:  Dp’(x)cClf}. 

Using  Dp  we  can  give  definition  of  the  P-upper  approximations  of  Clf  and  Clf, 
complementary  to  P(Clf)*  and  P{Clf)* ,  respectively: 

P(Clf)*={x€U*p:  Dp*(x)nClf^0},  (8.1) 

P(Clf)’={xeU*p:  Df*(x)nClf?i0}.  (8.2) 

For  each  ClfcU  and  ClfcU,  ]et  (Clf)*  =  ClfnUp  and  (Clf)*  =  Clf  nUp  • 
Let  us  remark  that  xe  Up  if  and  only  if  there  exists  such  that  RcP  and  xgUr. 
For  each  Clf  and  Clf,  t=l,...,n,  and  for  each  PcC:  P(Clf )*c(Clf )*c P(Clf )* , 
P(Clf )* Q(Clf )*c P(Clf )*  {rough  inclusion).  Moreover,  for  each  Clf,  t=2,..,,n, 
and  Clf,  t=l,...,n-l,  and  for  each  PcC:  P(Clf  )*- Up  -  P(Clf_i)* ,  P(Clf)*  =  Up- 
P(Clf+|)*  {complementarity). 

The  P-boundary  of  and  Clf,  t=l,...,n,  approximated  with  Dp  are  equal, 
respectively,  to  Bn;  ( Clf  )=  P(Clf  )*  -  P(Clf  )* ,  Bn;  ( Clf  )=  P(Clf  )*  -  P(Clf  )* . 

Theorem  4.  {Monotonicity  of  the  accuracy  of  approximation)  For  each  Clf  and 
Clf ,  t=l,...,n,  and  for  each  P,RqC,  such  that  PcR,  the  following  inclusions  hold: 
P(Clf)*cR(Clf)*,  P(Cif)*cR(Clf)*. 

Furthermore,  if  Up  =  Uj  >  the  following  inclusions  are  also  true: 

P(Clf  )*3R(Clf)* ,  P(Clf)*3R(Clf)* . 

Due  to  Theorem  4,  when  augmenting  a  set  of  attributes  P,  we  get  lower 
approximations  of  Clf  and  Clf,  t=l,...,n,  that  are  at  least  of  the  same  cardinality. 
Thus,  we  can  restore  for  the  case  of  missing  values  the  key  concepts  of  the  rough  sets 
theory:  accuracy  and  quality  of  approximation,  reduct  and  core. 

For  every  teT  and  for  every  PcC  we  define  the  quality  of  approximation  of 
partition  Cl  by  set  of  attributes  P,  or  in  short,  quality  of  sorting: 
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The  quality  expresses  the  ratio  of  all  P-correctly  sorted  objects  to  all  objects  in  the 
decision  table. 

Each  minimal  subset  PcC  such  that  Yp(ci)  =  yc(Cl)  is  called  a  reduct  of  Cl  and 
denoted  by  REDc,  (C).  Let  us  remark  that  a  decision  table  can  have  more  than  one 
reduct.  The  intersection  of  all  reducts  is  called  the  core  and  denoted  by  COREc,  (C). 


2.4  Decision  rules  for  multi-criteria  sorting  with  missing  values 

Using  the  rough  approximations  (5.1),  (5.2),  (6.1),  (6.2)  and  (7.1),  (7.2),  (8.1), 
(8.2),  it  is  possible  to  induce  a  generalized  description  of  the  information  contained  in 
the  decision  table  in  terms  of"//...,  then..."  decision  rules. 

Given  the  preference-ordered  classes  of  partition  CI={Clt,  teT},  T={l,...,n},  of 
U,  the  following  three  types  of  decision  rules  can  be  considered: 

1)  T>>-decision  rules  with  the  following  syntax: 

"// f(x,qi)>r,i  W  and  then  xeClf", 

where  P={q,,...,qp}cC,  (r,i,...,r,p)6V,|xV,2x...xV,pand  teT; 

2)  T>^-decision  rules  with  the  following  syntax: 

"if  f(x,qi)<r,i  and  f(x,q2)<r,2  and ...  f(x,qp)<r,p,  then  xe  Clf 
where  P={q . ,qp}eC,  (r,i,...,r,p)6V,|XV<,2X...xV,pand  tsT; 

3)  D^<~decision  rules  with  the  following  syntax: 

"if  f(x,q,)>rq,  and  f(x,q2)>rq2  and  ...  f(x,qk)>rqi<  and  f(x,qk+i)<rqk+i  and  ... 
f(x,qp)<rqp,  then  xgC1sUC1s+iU...uC1,", 

where  0’={qi,...,qk}cC,  0”={qk+i,...,qp}cC,  P=0’uO”,  O’  and  O”  not 
necessarily  disjoint,  (rqi,...,rqp)GVq|xVq2x...xVqp,  s,t€T  such  that  s<t. 

As  it  is  possible  that  {qi,...,qk}n{qk+i,...,qp}7^=0,  in  the  condition  part  of  a  D^^- 

decision  rule  we  can  have  "f(x,q)>rq"  and  "f(x,q)<rq",  where  rq<r'q,  for  some  qeC. 
Moreover,  if  rq=r  q,  the  two  conditions  boil  down  to  "f(x,q)=rq". 

Since  each  decision  rule  is  an  implication,  by  a  minimal  decision  rule  we 
understand  such  an  implication  that  there  is  no  other  implication  with  an  antecedent 
of  at  least  the  same  weakness  and  a  consequent  of  at  least  the  same  strength. 

We  say  that  an  object  supports  a  rule  if  its  evaluation  by  set  C  of  criteria  matches 
the  condition  part  of  the  rule. 

A  set  of  decision  rules  is  complete  if  it  fulfils  the  following  conditions: 

-  each  ye  C(Clf  )*  supports  at  least  one  D^^-decision  rule  of  the  type  "//f(x,qi)>rqi 
and  f(x,q2)>rq2  and ...  f(x,qp)>rqp,  then  xg  Clf  ”,  with  r,tG{2,...,n}  and  r>t, 

-  each  ye  C(Clf  )*  supports  at  least  one  D<-decision  rule  of  the  type  "if  f(x,qi)<rqi 
^z«^/f(x,q2)<rq2  and ...  f(x,qp)<rqp,  then  xg  C\V\  with  u,tG  {l,...,n-l }  and  u<t, 
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-  each  ye  C(Clf  )*nC(Clf  )*  supports  at  least  one  D><-decision  rule  of  the  type 
"if  f(x,q,)>rq,  and  f(x,q2)>rq2  and  ...  f(x,qk)>rqk  and  f(x,qk+i)<rqi,+]  and  ... 
f(x,qp)<rqp,  then  x€ClvUClv+iU...uCIz",  with  s,t,v,zeT  and  v<s<t<z. 

Let  us  remark  that  application  of  any  complete  set  of  decision  rules  on  the  objects 
from  the  data  table  results  in  either  exact  or  approximate  reassignment  of  these 
objects  to  the  classes  Clj,  teT.  Let  us  explain  this  reassignment  in  more  detail. 

Given  a  complete  set  of  rules,  and  an  object  yeU,  such  that  y  ^  Bnc(cif )  and 

y  ^  Bn^  (cij )  for  any  sgT,  the  following  situations  may  occur: 

-  yeClt,  t=2,...,n-l;  then  there  exists  at  least  one  D^-decision  rule  with  consequent 

X  e  Clf ,  and  at  least  one  D<-decision  rule  with  consequent  x  €  Clf  ; 

-  yeCli;  then  there  exists  at  least  one  D<-decision  rule  with  consequent  x  g  Clf  ; 

-  yGCIn;  then  there  exists  at  least  one  D^-decision  rule  with  consequent  x  g  ClJ . 

In  all  above  situations,  intersection  of  all  unions  (upward  and  downward)  of 
classes  suggested  for  assignment  in  the  consequent  of  rules  matching  object  y  will 
result  in  (exact)  reassignment  of  y  to  class  Clt,  teT. 

Similarly,  for  each  object  yG  C(Clf  )*n C(Clf  )* ,  s<t,  such  that  y€  C(Clf|)*n 

C(Clfi)*,  sl<[<]s  and  t<[<]tl,  which  means  that  y  belongs  exclusively  to  boundaries 

Bnc(Clv),  v=s+l,...,t,  and  Bni0(Clf),  z=s,...,t-l,  there  exists  at  least  one  D>^- 

decision  rule  whose  consequent  is  xGClsUCls+|U...uClt.  Thus,  in  result  of 
application  of  the  complete  set  of  rules  to  object  y,  it  will  be  reassigned 
(approximately)  to  classes  ClsUCls+[U...uClt. 

We  call  minimal  each  set  of  minimal  decision  rules  that  is  complete  and  non- 
redundant,  i.e.  exclusion  of  any  rule  from  this  set  makes  it  non-complete. 


3  Conclusions 

We  adapted  the  rough  sets  methodology  to  the  analysis  of  data  sets  with  missing 
values.  The  adaptation  concerns  both  the  classical  rough  set  approach  based  on  the 
use  of  indiscernibility  relations  and  the  new  rough  set  approach  based  on  the  use  of 
dominance  relations.  While  the  first  approach  deals  with  multi-attribute  classification 
problems,  the  second  approach  deals  with  multi-criteria  sorting  problems.  The  two 
adapted  rough  set  approaches  maintain  all  good  characteristics  of  their  original 
approaches.  They  also  boil  down  to  the  original  approaches  when  there  are  no 
missing  values. 

The  case  of  missing  values  is  very  often  met  in  practice  and  not  many  methods 
can  deal  satisfactorily  with  this  problem.  The  way  of  handling  the  missing  values  in 
our  approach  seems  faithful  with  respect  to  available  data  because  the  decision  rules 
are  robust  in  the  sense  of  being  founded  on  objects  existing  in  the  data  set  and  not  on 
hypothetical  objects  created  by  putting  some  possible  values  instead  of  the  missing 
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ones.  This  is  a  distinctive  feature  of  our  extension  in  comparison  with  the  extension 
proposed  by  Kryszkiewicz  [4]. 
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Abstract.  The  gRS-ILP  model  (generic  Rough  Set  Inductive  Logic 
Programming  model)  provides  a  framework  for  Inductive  Logic  Program¬ 
ming  when  the  setting  is  imprecise  and  any  induced  logic  program  will 
not  be  able  to  distinguish  between  certain  positive  and  negative  exam¬ 
ples.  However,  in  this  rough  setting,  where  it  is  inherently  not  possible  to 
describe  the  entire  data  with  100%  accuracy,  it  is  possible  to  definitively 
describe  part  of  the  data  with  100%  accuracy.  The  gRS-ILP  model  is 
extended  in  this  paper  to  motifs  in  strings.  An  illustrative  experiment 
is  presented  using  the  ILP  system  Progol  on  transmembrane  domains  in 
amino  acid  sequences. 

Keywords:  Rough  Set  Theory;  Inductive  Logic  Programming;  Machine  Learn¬ 
ing;  Knowledge  Discovery  from  Data;  Molecular  biology; 

1  Introduction 

Inductive  Logic  Programming  [1]  is  the  research  area  formed  at  the  intersection 
of  logic  programming  and  machine  learning.  Inductive  Logic  Programming  (in 
the  example  setting)  uses  background  knowledge  definite  clauses,  and  positive 
and  negative  example  ground  facts  to  induce  a  logic  program  that  describes  the 
examples,  where  the  induced  logic  program  consists  of  the  original  background 
knowledge  along  with  an  induced  hypothesis  (as  definite  clauses). 

Rough  set  theory  [2,3]  defines  an  indiscernibility  relation,  where  certain  sub¬ 
sets  of  examples  cannot  be  distinguished.  A  concept  is  rough  when  it  contains  at 
least  one  such  indistinguishable  subset  that  contains  both  positive  and  negative 
examples.  It  is  inherently  not  possible  to  describe  the  examples  accurately,  since 
certain  positive  and  negative  examples  cannot  be  distinguished. 

The  gRS-ILP  model  [4]  introduces  a  rough  setting  in  Inductive  Logic  Pro¬ 
gramming  and  describes  the  situation  where  the  background  knowledge,  declara¬ 
tive  bias  and  evidence  are  such  that  it  is  not  possible  to  induce  any  logic  program 
from  them  that  is  able  to  distinguish  between  certain  positive  and  negative  ex¬ 
amples.  Any  induced  logic  program  will  either  cover  both  the  positive  and  the 


159 


negative  examples  in  the  group,  or  not  cover  the  group  at  all,  with  both  the 
positive  and  the  negative  examples  in  this  group  being  left  out. 

The  gRS-ILP  model  has  useful  applications  in  the  definitive  description  of 
large  data.  Knowledge  discovery  in  databases  is  the  non~trivial  process  of  identi¬ 
fying  valid,  novel,  potentially  useful,  and  ultimately  understandable  patterns  in 
data([5]).  This  usually  involves  one  of  two  different  goals:  prediction  and  descrip¬ 
tion.  Prediction  involves  using  some  variables  or  fields  in  the  database  to  predict 
unknown  or  future  values  of  other  variables  of  interest.  Description  focuses  on 
finding  human-interpretable  patterns  describing  the  data.  Definitive  description 
is  the  description  of  the  data  with  full  accuracy.  In  a  rough  setting,  it  is  not 
possible  to  definitively  describe  the  entire  data,  since  some  of  the  positive  exam¬ 
ples  and  negative  examples  (of  the  concept  being  described)  inherently  cannot 
be  distinguished  from  each  other. 

Conventional  systems  handle  a  rough  setting  by  using  various  techniques  to 
induce  a  hypothesis  that  describes  the  evidence  as  well  as  possible.  They  aim  to 
maximize  the  correct  cover  of  the  induced  hypothesis  by  maximizing  the  number 
of  positive  examples  covered  and  negative  examples  not  covered.  This  means 
that  most  of  the  positive  evidence  would  be  described,  along  with  some  of  the 
negative  evidence.  The  induced  hypothesis  cannot  say  with  certainty  whether 
an  example  definitely  belongs  to  the  evidence  or  not.  However,  the  gRS-ILP 
model  lays  a  firm  theoretical  foundation  for  the  definitive  description  of  data  in 
a  rough  setting.  A  part  of  the  data  is  definitively  described.  The  rest  of  the  data 
can  then  be  described  using  conventional  methods,  but  not  definitively. 

This  paper  extends  the  gRS-ILP  model  to  motifs  in  strings,  and  reports  an 
illustrative  experiment  using  Progol  on  transmembrane  domains  in  amino  acid 
sequences. 

2  Formal  definitions  of  the  gRS-ILP  model 

The  generic  Rough  Set  Inductive  Logic  Programming  (gRS-ILP)  model  intro¬ 
duces  the  basic  definition  of  elementary  sets  and  a  rough  setting  in  ILP  [6,4]. 
The  essential  feature  of  an  elementary  set  is  that  it  consists  of  examples  that 
cannot  be  distinguished  from  each  other  by  any  induced  logic  program  in  that 
ILP  system.  The  essential  feature  of  a  rough  setting  is  that  it  is  inherently  not 
possible  for  the  consistency  and  completeness  criteria  to  be  fulfilled  together, 
since  both  positive  and  negative  examples  are  in  the  same  elementary  set.  The 
basic  definitions  formalised  in  [7]  follow. 

2.1  The  RSILP  system 

We  first  formally  define  the  ILP  system  in  the  example  setting  of  [8]  as  follows. 

Definition  2.1.  An  ILP  system  in  the  example  setting  is  a  tuple  S^s=^{Ees,B), 
where 

(1)  Ees  =  F+  U  E-  is  the  universe,  where  F+  is  the  set  of  positive  examples 
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(true  ground  facts),  and  E~^  is  the  set  of  negative  examples  (false  ground  facts), 
and 

(2)  5  is  a  background  knowledge  given  as  definite  clauses  such  that  (i)  for  all 
e"  €  -E” ,  B  \/  and  (ii)  for  some  e'^  E  B  \f  e+.  □ 

Let  5es  =  {Ees:  B)  be  an  ILP  system  in  the  example  setting.  Then  let  7i{Ses) 
(also  written  as  7i(Ees,B))  denote  the  set  of  all  possible  definite  clause  hypothe¬ 
ses  that  can  be  induced  from  E^s  and  5,  and  be  called  the  hypothesis  space 
induced  from  S^s  (or  from  E^s  and  B).  Further,  let  V{Ses)  (also  written  as 
V{E^s:  B)  =  {P-BAH\HE  H{Ees,  B)})  denote  the  set  of  all  the  programs 
induced  from  E^s  and  and  be  called  the  program  space  induced  from  Ses  (or 
from  Ees  and  B). 

Our  aim  is  to  find  a  program  P  6  V{Ses)  such  that  the  next  two  conditions 
hold:  (iii)  for  all  e~  E  ,  P  \/  e~ ,  (iv)  for  all  e~^  E  E^^,  P  h  e"*". 

The  following  definitions  of  Rough  Set  ILP  systems  in  the  gRS-ILP  model 
(abbreviated  as  RSILP  systems)  use  the  terminology  of  [8]. 

Definition  2.2.  An  RSILP  system  in  the  example  setting  (abbreviated  as 
RSILP -E  system)  is  an  ILP  system  in  the  example  setting,  Ses  =  {Ee3,B),  such 
that  there  does  not  exist  a  program  P  E  P{Ses)  satisfying  both  the  conditions 
(iii)  and  (iv)  above.  □ 

Definition  2.3.  An  RSILP-E  system  in  the  single-predicate  learning  context 
(abbreviated  as  RSILP-ES  system)  is  an  RSILP-E  system,  whose  universe  E  is 
such  that  all  examples  (ground  facts)  in  E  use  only  one  predicate,  also  known 
as  the  target  predicate.  □ 

A  declarative  bias  [8]  biases  or  restricts  the  set  of  acceptable  hypotheses,  and 
is  of  two  kinds:  syntactic  bias  (also  called  language  bias)  that  imposes  restrictions 
on  the  form  (syntax)  of  clauses  allowed  in  the  hypothesis,  and  semantic  bias  that 
imposes  restrictions  on  the  meaning,  or  the  behaviour  of  hypotheses. 

Definition  2.4.  An  RSILP~ES  system  with  declarative  bias  (abbreviated  as 
RSILP-ESD  system)  is  a  tuple  S  =  (5',I/),  where 

(i)  S'  —  (E,  B)  is  an  RSILP-ES  system,  and 

(ii)  L  is  a  declarative  bias,  which  is  any  restriction  imposed  on  the  hypothesis 
space  ?i(E,  B). 

We  also  write  S  =  (E,  R,  L)  instead  oi  S  —  {S' .^L).  □ 

For  any  RSILP-ESD  system  S  —  (E,  E,  L),  let 
H{S)  —  {H  E  H(E, E)  I  H  is  allowed  by  L},  and 
p{S)  =  {P^BaH\He  n{S)}. 

^{{S)  (also  written  as  HiE.B,  L))  is  called  the  hypothesis  space  induced  from  S 
(or  from  E,  E,  and  L).  V{S)  (also  written  as  ^(E,  E,L))  denotes  the  set  of  all 
the  programs  induced  by  5,  and  is  called  the  program  space  induced  from  S  (or 
from  E,  E,  and  L). 
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2.2  Equivalence  relation,  elementary  sets  and  composed  sets 
We  now  define  an  equivalence  relation  on  the  universe  of  an  RSILP-ESD  system. 

Definition  2.5.  Let  S  =  {E,B,L)  be  an  RSILP-ESD  system.  An  indiscerni- 
bility  relation  of  5,  denoted  by  R(S),  is  a  relation  on  E  defined  as  follows: 
G  E,  (x,y)  e  R{S)  iff 

{P  \-  X  P  \-  y)  for  any  P  6  V{S)  (i.e.  iff  x  and  y  are  inherently  indistinguish¬ 
able  by  any  induced  logic  program  F  in  'P(S')).  □ 

The  following  fact  follows  directly  from  the  definition  of  R{S), 

Fact  1  For  any  RSILP-ESD  system  S,  R{S)  is  an  equivalence  relation. 

Definition  2.6.  Let  S  =  {E,B,L)  be  an  RSILP-ESD  system.  An  elementary 
set  of  R{S)  is  an  equivalence  class  of  the  relation  jR(<S').  For  each  x  £  E,  let 
[x]r(s)  denote  the  elementary  set  of  R{S)  containing  x.  Formally, 

Wh(s)  =  {y  e  E  \  {x,y)  e  R{S)}. 

A  composed  set  of  R{S)  is  any  finite  union  of  elementary  sets  of  R[S).  □ 

Definition  2.7.  An  RSILP-ESD  system  S  =  (E,  B,  L)  is  said  to  be  in  a  rough 
setting  iff 

3e+eE+  3e-eE-  (  (e+,  e")  e  R(5)  ).  □ 

We  now  define  a  combination  of  declarative  biases. 

Let  5  =  {E,B)  be  an  RSILP-ES  system.  Let  Li,  L2  and  L3  be  declarative 
biases.  Li  A  L2  (resp.,  Li  V  L2)  denotes  the  declarative  bias  such  that  7^(5")  = 
n(Si)r]n{S2)  (resp.,  n{S")  =  n{Si)un{S2)),  where  5'  =  (E,R,Li  AL2), 
5"  =  (F,R,Li  VL2),  Si  =  {E,B,Li)  and  ^2  -  {E,B,L2)  are  RSILP-ESD 
systems. 

Li  A  L2  A  L3  (resp.,  (Li  A  L2)  V  L3)  denotes  the  declarative  bias  such  that 
H(S"')  =  H(5i)  n  W(52)  n  W(S3)  (resp.,  7^(5"")  =  (H(S^)  n  ^(Sz))  U  HiSs)), 
where  S'"  =  (E,  B,  Li  A  L2  A  L3),  S""  =  {E,  B,  (Li  A  L2)  V L3),  Si  —  (F,  R,  Li), 
52  =  {E,B,L2)  and  53  =  (E.B.Ls)  are  RSILP-ESD  systems.  Li  V  L2  V  L3, 
(Li  V  L2)  A  Z/3,  . . . ,  etc.  are  defined  similarly. 

2.3  Consistency  and  completeness  in  the  gRS— ILP  model 

Let  5  =  (E,F,L)  be  an  RSILP-ESD  system,  and  V{S)  the  program  space 
induced  by  5. 

Definition  2.8.  The  upper  approximation  of  5,  Upap{S)^  is  defined  as  the  least 
composed  set  of  R{S)  such  that  E^  C  Upap{S).  □ 

Definition  2.9.  The  lower  approximation  of  5,  Loap{S),  is  defined  as  the 
greatest  composed  set  of  R{S)  such  that  Loap{S)  C  F+.  □ 

The  set  Bndr{S)  =  Upap{S)  ~  Loap{S)  is  known  as  the  boundary  region  of 
5  (or  the  borderline  region  of  5).  The  lower  approximation  of  5,  Loap{S),  is  also 
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known  as  Pos{S),  the  positive  region  of  S.  The  set  Neg{S)  —  E  —  Upap{S)  is 
known  as  the  negative  region  of  S. 

Definition  2.10.  The  consistent  program  space  Vconsi^)  of  ^  is  defined  as 
Vcor^s{S)  =  {PeV{S)\PVe-,  ^e-eE-}. 

A  program  P  G  ViS)  is  consistent  with  respect  to  S'  iff  P  G  Vcons{S). 

The  reverse-consistent  program  space  Vrev-consiS)  of  S  is  defined  as 
Vrev~cc^s(S)  =  {P  E  V{S)  |  P  Ve+  G  E+}. 

A  program  P  G  P(S)  is  reverse- consistent  w.r.t.  S  iff  P  G  P  rev— cons  (S).D 

Definition  2.11.  The  complete  program  space  VcompiS)  of  S  is  defined  as 
VcompiS)  =  {Pe  V{S)  1  P  h  e+,  Ve+  6  E+}. 

A  program  P  G  P(S)  is  complete  with  respect  to  S  iff  P  G  VcompiS)-  □ 

Definition  2.12.  The  co?;er  of  a  program  P  G  P(S)  in  S  is  defined  as 
coveriS,  P)  =  {e  G  P  I  P  h  e}.  □ 

The  following  facts  follow  directly  from  the  definitions. 

Fact  2  VP  G  VconsiS),  coveriS,P)  C  Loap{S). 

Fact  3  VP  G  VcompiS),  coveriS,P)  3  UpapiS). 

Fact  4  VP  G  VcompiS),  (P  -  coi;er(S,P))  C  (P  -  UpapiS)). 

Fact  5  VP  G  Vrev-consiS),  coveriS.P)  c  (P  -  UpapiS)). 

Fact  6  VP  G  VconsiS),  P  h  e  e  G  P+. 

Fact  7  VP  G  VcompiS),  P\/  e  e  E-. 

Fact  8  VP  G  Vrev-consiS),  P  H  6  6  G  P~  . 

These  facts  are  used  in  a  rough  setting  for  the  definitive  description  of  data. 
Definitive  description  involves  the  description  of  the  data  with  100%  accuracy. 
In  a  rough  setting,  it  is  not  possible  to  definitively  describe  the  entire  data, 
since  some  of  the  positive  examples  and  negative  examples  (of  the  concept  being 
described)  inherently  cannot  be  distinguished  from  each  other.  These  facts  show 
that  definitive  description  is  possible  in  a  rough  setting  when  an  example  is 
covered  by  a  consistent  program  (the  example  is  then  definitely  positive),  covered 
by  a  reverse-consistent  program  (the  example  is  then  definitely  negative),  or  is 
not  covered  by  a  complete  program  (the  example  is  then  definitely  negative). 

2.4  Some  useful  declarative  biases 

Let  Lpi  be  the  declarative  bias  such  that  for  any  RSILP-ESD  system  S  = 
(P,P,Lpi),  H  G  niS)  head  predicate  of  C  is  the  target  predicate,  for  any 
C  e  H  (predicate  invention  is  not  allowed), 
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let  Lrd  be  the  declarative  bias  such  that  for  any  RSILP-ESD  system  S  = 
{E,B,Lrd),  H  e  '^(5')  head  predicate  of  C  is  not  in  the  body  of  C,  for 
any  C  £  H  (directly  recursive  definition  is  not  allowed),  and 
let  Leu  be  a  declarative  bias  such  that  for  any  RSILP-ESD  system  S  ^  {E^B,  Leu), 
H  €  ^  e  ^  C  for  any  e  £  E  and  any  C  £  H. 

Let  V  be  any  set  of  ground  atoms.  Let  pred{V)  denote  the  set  of  predicate 
symbols  used  in  V.  For  each  A  C  pred(V),  let  Va  =  {q{’  ■  ^  ^  I  5  ^ 

placelist{A)  =  {{q,i)  \  q  £  A,  and  l<i<nq  where  Uq  is  the  arity  of  q}.  Let  B  be 
any  background  knowledge  of  an  RSILP-ES  system.  For  each  Z  C  placelist{A), 
where  A  C  pred{B),  let  Lz  be  the  declarative  bias  such  that,  for  any  RSILP-ES 
(E,  B)  with  B  as  the  background  knowledge: 

\/H  £niE,B,Lz),  VCgE 

G  C  =>  [Qf  G  A  AV^  G  ,n}  [{q,i)  £  Z  U  is  a,  variable]]]. 


3  The  gRS-ILP  model  and  motifs  in  strings 

3.1  Definition  of  a  motif— RSILP-ESD  system 

Let  E  be  a  finite  alphabet  of  letters.  A  string  over  E  is  any  sequence  of  finite 
length  composed  of  letters  from  E.  We  use  E^  to  denote  the  set  of  all  the  strings 
over  E.  Let  the  term  substring  of  a  string  have  its  usual  meaning.  (Note  that  the 
characters  in  a  substring  of  a  string  x  must  occur  contiguously  in  rr.)  If  r  (g  E+) 
is  a  substring  of  a  string  s  (g  then  r  is  called  a  positive  motif  of  s.  If  r  is 
not  a  substring  of  the  string  s,  then  r  is  called  a  negative  motif  of  s. 

Definition  3.13.  We  define  a  motif-RSILP-ESD  system  as  a  2-tuple  S  - 
{S',  {Ei,E2,  . . . ,  ^n}),  for  some  finite  n  >  1,  where: 

(1)  each  Ei,  1  <  z  <  n,  is  a  finite  alphabet  of  letters,  and 

(2)  S'  =  {E,B,L)  is  an  RSILP-ESD  system  such  that 

(i)  E  is  the  universe  of  examples  consisting  of  a  unary  predicate,  say  p, 

(ii)  B  is  the  background  knowledge  consisting  of  ground  unit  clauses,  using  the 
following  three  predicates:  strings  (of  arity,  say  m),  contains  and  abstains  (both 
of  arity  2),  where  for  any  p{x)  £  E: 

(a)  strings{x,Si,S2,  ■  •  ■  ySm~i)  G  E  Si,  52?  •  •  •  ?  ^m-i  are  attribute  strings 
of  the  example  p{x) , 

where  for  each  1  <  J  m  —  1,  Sj  G  E^  for  some  1  <  z  <  n, 

(b)  contains{r,s)  £  B  r  {£  E^)  is  a  positive  motif  of  attribute  string 
s  (G  E^),  1  <  i  <  n,  and 

(c)  ab stains {r,  s)  £  B  =>  r  {£  E^)  is  a  negative  motif  of  attribute  string 
s  (g  Ef),  1  <  z  <  n,  and 

(iii)  L  is  the  declarative  bias  L  =  Lz  A  Lpi  A  Lrd  A  Leu, 

where  A  ~  {strings,  contains,  abstains}  =  pred{B)  and  Z  =  {{strings,  1), 
{strings,  2), . . . ,  {strings, m),  {contains,  2),  {abstains,  2)}.  □ 

It  is  seen  that  the  motif-RSILP-ESD  system  is  an  R-RSILP-ESD  system 
studied  in  [7]. 
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It  is  to  be  noted  that  the  model  can  be  expressed  in  an  alternate  manner,  with 
B  using  2-arity  predicates  of  the  form  di(a:,si),  ^2(2^552),  •  • 
rather  than  the  single  m-arity  predicate  strings{x^  si,  S2,  •  ■  ■ ,  Sm-i)-  is  then 
suitably  modified  with  A  —  {di^d2, ...  (contains  ^abstains)  and  Z  = 

{(di,  1),  (di,  2), . . . ,  (d^_i,  1),  (d^_i,  2),  {contains,  2),  {abstains,  2)}. 

3.2  An  example 

We  now  consider  an  illustration  of  a  motif-RSILP-ESD  system. 

The  Protein  Identification  Resource  database  [9]  contains  amino  acid  se¬ 
quences,  with  the  FEATURE  field  for  each  sequence  indicating  where  the  trans¬ 
membrane  domains  are  located  within  the  sequence.  The  amino  acid  sequences 
are  cut  into  substrings  in  such  a  manner  that  positive  example  attribute  strings 
are  entirely  within  transmembrane  domains,  and  negative  example  attribute 
strings  are  entirely  outside  transmembrane  domains. 

The  identification  of  transmembrane  domains  from  amino  acid  sequences  is 
described  in  [10].  A  decision  tree  is  learnt  that  can  classify  any  new  attribute 
string  as  a  transmembrane  domain.  The  simple  form  ‘xAy’  of  a  regular  pattern 
language  [10]  is  used  in  the  nodes  of  a  decision  tree,  ‘x’  and  ‘y’  are  variable  sub¬ 
strings  and  ‘A’  is  a  given  fixed  substring  (the  motif).  The  decision  tree  consists 
of  leaf  nodes  (labelled  with  the  resulting  class)  and  internal  nodes  (labelled  with 
a  regular  pattern  of  the  form  ‘xAy’).  At  an  internal  node,  the  decision  tree  tests 
if  the  attribute  string  matches  the  regular  pattern.  The  ‘Y’  path  of  the  node  of 
the  decision  tree  is  taken  when  the  attribute  string  is  of  the  form  ‘xAy’,  that 
is,  the  motif  ‘A’  is  contained  in  the  attribute  string;  and  the  ‘N’  path  taken 
otherwise. 

The  simple  form  ‘xAy’  of  a  regular  pattern  language  determines  whether  a 
given  motif  ‘A’  is  ‘contained’  in  the  attribute  string.  The  ‘contains’  operator  has 
been  studied  in  detail  in  [11].  The  ‘contains’  and  ‘abstains’  operators  are  used 
in  [12, 13]  to  learn  transmembrane  domains  from  amino  acid  sequences.  The 
‘contains’  operator  is  true  when  the  motif  is  contained  in  the  attribute  string 
and  false  otherwise.  The  operator  ‘abstains’  is  the  opposite  of  contains,  and  is 
true  when  the  motif  is  not  contained  in  the  attribute  string. 

The  Kyte  and  Doolittle  hydropathy  index  [14]  of  an  amino  acid  is  used  to 
distinguish  the  amino  acids  into  three  distinct  categories.  The  twenty  symbol 
amino  acid  sequences  are  transformed  into  three  symbol  strings  by  assigning 
each  amino  acid  symbol  to  one  of  the  following  three  distinct  categories:  amino 
acids  with  positive  hydropathy  index  (1.8  to  4.5)  (*),  with  slightly  negative 
hydropathy  index  (-1.6  to  -0.4)  (+),  and  those  with  very  negative  hydropathy 
index  (-4.5  to  -3.2)  (-).  is  an  alphabet  of  the  3  letters  +,  -  and  *.  E2  is  an 
alphabet  of  3  letters  a,  b,  n  according  to  whether  the  amino  acid  is  acidic,  basic 
or  neutral. 

That  is,  let  Ei  =  {+,-,*}  and  E2  =  {a, b,n}. 

Let  E  =  {p(el),p(e2),p(e3)}. 

Let  B  =  {strm55(el,+++-++,aaaaaa),  5trm^5(e2,++-,aab),  strings (e3,++-, aba), 
contains{++ ,+++-++),  contains{+-  y+++~++),contains{-+  y+++-++), . . . , 
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abstains{+* ,+++-++) ^ . . .  ,contams(a,  aaaaaa),  contams(a,  aab), . . . , 
abstains  (b ,  aaaaaa) 

Let  L  =  Lz  A  Lpi  A  Lrd  A  Lgu, where  A  =  pred[B)  and  Z  =  {{strings,  l)y 
{strings, 2),  {strings, Z),  {contains,2),  {abstains,  2)}. 

S  =  {S',  El,  E2),  where  S'  =  {E,  B,  L),  is  a  motif-RSILP-ESD  system. 


3.3  Experimental  illustration 

583  positive  and  583  negative  examples  of  the  transmembrane  data  from  PIR 
[9]  are  used  in  this  experimental  illustration.  The  amino  acid  sequences  are  con¬ 
verted  into  three  symbol  strings  based  on  the  Kyte  and  Doolittle  hydropathy 
index  of  the  amino  acid  as  described  above.  This  is  the  same  translation  mech¬ 
anism  used  initially  in  [10].  The  symbols  0,  1  and  2  are  used  instead  of  the 
symbols  +  and  respectively,  that  are  used  in  [10].  Motif  length  of  2  is  used. 

Progol  is  an  Inductive  Logic  Programming  system  written  in  C  by  Dr,  Mug- 
gleton  [15].  The  syntax  for  examples,  background  knowledge  and  hypothesis  is 
Dec- 10  Prolog.  Headless  Horn  clauses  are  used  to  represent  negative  examples 
and  constraints.  Progol  source  code  and  example  files  are  freely  available  (for 
academic  research)  from  ftp.cs.york.ac.uk  under  the  appropriate  directory 
in  pub/ML_GROUP.  Progol  version  4.4  dated  25.08.98  is  used  in  this  study. 

Since  only  one  type  of  string  is  used  in  this  experimental  illustration,  a  sim¬ 
plified  form  of  the  motif-RSILP-ESD  system  is  used.  The  background  knowl¬ 
edge  B  consists  of  predicates  such  as  c(pl,s22)  or  a(pl,s22),  when  the  motif 
‘22’  (equivalent  to  ‘"’)  is  present  or  not  present,  respectively,  in  the  attribute 
string  of  example  tin(pl).  The  positive  examples  {E'^)  are  given  as  ‘tm(pl) 
to  ‘tm(p583) .’  and  the  negative  examples  {E~)  are  given  as  ‘i-  tm(nl).’  to 
‘ :  -  tin(n583)  .  ’  Appropriate  mode  declarations  are  used  in  Progol  to  incorporate 
the  required  declarative  bias  L.  Let  S  —  {E,B,L). 

The  first  step  is  any  conventional  Progol  experiment  using  the  data  set.  Con¬ 
ventionally,  the  aim  is  to  maximise  the  correct  cover  of  both  positive  and  negative 
examples  (in  other  words,  try  to  increase  the  number  of  positive  examples  cov¬ 
ered  and  decrease  the  number  of  negative  examples  covered).  Let  this  induced 
program  be  known  as  P  for  the  purpose  of  this  outline. 

The  second  step  uses  Progol  with  the  default  noise  setting  of  zero,  where 
any  induced  hypothesis  is  consistent  and  cannot  cover  any  negative  example. 
Let  this  induced  consistent  program  be  Peons ■ 

The  induced  hypothesis  of  Peons  follows. 

tm(A)  a(A,sll),  a(A,sl2),  a(A,s22). 

tm(A)  a(A,s20),  a(A,s21),  c(A,sl2). 

tm(A)  a(A,sl2),  a(A,s22),  c(A,s21). 

tm(A)  a(A,sl2),  a(A,s20),  c(A,s22) . 

tm(A)  :-a(A,s02),  a(A,sll). 

tin(A)  a(A,sll),  a(A,s21),  a(A,s22). 
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tm(A)  a(A,sll),  a(A,s20) . 

tm(A)  a(A,sl2),  a(A,s20),  c(A,s02). 

tm(A)  a(A,s02),  a(A,sl2),  c(A,s20). 

tm(A)  a(A,slO),  a(A,s21). 

The  third  step  is  to  determine  a  reverse-consistent  program  denoted  by 
Prev-consi  by  exchanging  the  roles  of  E~,  and  then  repeating  step  2.  The 
induced  hypothesis  of  Prev-cons  follows. 

tm(A)  a(A,sOO). 

tni(A)  a(A,slO),  c(A,s21). 

tin(A)  a(A,s01),  c(A,s22). 

The  results  are  tabulated  below. 


le+i 

|£-| 

\E\ 

\cover{S,  Pco7is)\ 

\cOVer(^S.,  Prev—cons^\ 

583 

583 

1166 

249 

55 

Using  Facts  6  and  8  we  have  the  following. 

If  Peons  b  e,  then  e  e  E'^. 

If  Prev—cons  b  then  c  ^  E  . 

Otherwise  P  is  used: 

If  P  h  e,  then  it  is  very  likely  that  e  G  E^\ 
else  if  P  I/  e,  then  it  is  very  likely  that  e  G  E~. 

249  out  of  583  positive  examples  are  definitively  described  by  Peons  and  55 
out  of  583  negative  examples  are  definitively  described  by  Prev-cons- 

Earlier  systems  conventionally  do  not  use  Peons  and  Prev-cons-  They  han¬ 
dle  the  rough  setting  by  inducing  P  to  maximize  correct  cover  by  maximizing 
the  number  of  positive  examples  covered  and  negative  examples  not  covered. 
However,  this  does  not  definitively  describe  the  data,  since  P  cannot  say  with 
certainty  whether  an  example  definitely  belongs  to  the  evidence  or  not.  When 
the  gRS-ILP  model  is  used,  Peons  and  Prev-cons  are  induced  to  definitively  de¬ 
scribe  part  of  the  data.  The  rest  of  the  data  can  be  described  by  P,  but  not 
definitively. 

4  Conclusions 

The  gRS-ILP  model  is  extended  to  motifs  in  strings.  An  illustrative  experiment 
is  presented  regarding  the  definitive  description  of  transmembrane  domains  from 
amino  acid  sequences  using  Progol.  Possibilities  for  further  work  include  exten¬ 
sions  of  the  gRS-ILP  model  to  areas  other  than  definitive  description,  such  as 
prediction,  and  to  other  application  areas. 
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Abstract.  Inductive  Logic  Programming  (ILP)  is  a  relatively  new  method 
in  machine  learning.  Compared  with  the  traditional  attribute-value  learn¬ 
ing  methods,  it  has  some  advantages  (the  stronger  expressive  power  and 
the  ease  of  using  background  knowledge),  but  also  some  weak  points.  One 
particular  issue  is  that  the  theory,  techniques  and  experiences  are  much 
less  mature  for  ILP  to  deal  with  imperfect  data  than  in  the  traditional 
attribute-value  learning  methods.  This  paper  applies  the  Rough  Set  the¬ 
ory  to  ILP  to  deal  with  imperfect  data  which  occur  in  large  real-world 
applications.  We  investigate  various  kinds  of  imperfect  data  in  ILP  and 
identify  a  subset  of  them  to  tackle.  Namely,  we  concentrate  on  incomplete 
background  knowledge  (where  essential  predicates/clauses  are  missing) 
and  on  indiscernible  data  (where  some  examples  belong  to  both  sets  of 
positive  and  negative  training  examples),  proposing  rough  problem  set¬ 
tings  for  these  cases.  The  rough  settings  relax  the  strict  requirements  in 
the  standard  normal  problem  setting  for  ILP,  so  that  rough  but  useful 
hypotheses  can  be  induced  from  imperfect  data. 


1  Introduction 

Inductive  Logic  Programming  (ILP,  see  [2,  6,  7,  8])  is  a  relatively  new  method 
in  machine  learning.  ILP  is  concerned  with  learning  from  examples  within  the 
framework  of  predicate  logic.  ILP  is  relevant  to  Knowledge  Discovery  and  Data 
Mining  (KDD),  and  compared  with  the  traditional  at  tribute- value  learning  meth¬ 
ods  (the  main  stream  in  KDD  community  up  to  date),  it  possesses  the  following 
advantages: 

-  ILP  can  learn  knowledge  which  is  more  expressive  than  that  by  the  attribute- 
value  learning  methods,  because  the  former  is  in  predicate  logic  while  the 
latter  is  usually  in  propositional  logic. 

-  ILP  can  utilize  background  knowledge  more  naturally  and  effectively,  be¬ 
cause  in  ILP  the  examples,  the  background  knowledge,  as  well  as  the  learned 
knowledge  are  all  expressed  within  the  same  logic  framework. 

However,  when  applying  ILP  to  KDD,  we  can  identify  some  weak  points  com¬ 
pared  with  the  traditional  attribute-value  learning  methods,  such  as: 

-  It  is  more  difficult  to  handle  numbers  (especially  continuous  values)  prevail¬ 
ing  in  real-world  databases,  because  predicate  logic  lacks  effective  means  for 
this. 
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~  The  theory,  techniques  and  experiences  are  much  less  mature  for  ILP  to  deal 
with  imperfect  data  (uncertainty,  incompleteness,  vagueness,  impreciseness, 
etc.  in  examples,  background  knowledge  as  well  as  the  learned  rules)  than  in 
the  traditional  attribute- value  learning  methods  (see  [3, 13, 15],  for  instance). 

In  [4],  we  suggested  Constraint  Inductive  Logic  Programming  (CILP),  an 
integration  of  ILP  and  CLP  (Constraint  Logic  Programming),  as  a  solution  for 
the  first  problem  mentioned  in  the  above. 

This  paper  addresses  the  second  problem,  applying  the  Rough  Set  theory  to 
ILP  to  deal  with  some  kinds  of  imperfect  data  which  occur  in  large  real-world  ap¬ 
plications.  Namely,  we  concentrate  on  incomplete  background  knowledge  (where 
essential  predicates/clauses  are  missing)  and  on  indiscernible  data  (where  some 
examples  belong  to  both  sets  of  positive  and  negative  training  examples),  propos¬ 
ing  rough  problem  settings  for  these  cases.  The  rough  settings  relax  the  strict 
requirements  in  the  standard  normal  problem  setting  for  ILP,  so  that  rough  but 
useful  hypotheses  can  be  induced  from  imperfect  data. 

This  paper  is  organized  as  follows:  First  in  Section  2  we  give  the  standard 
problem  setting  for  ILP  which  assumes  that  everything  is  correct  and  perfect. 
Section  3  lists  various  kinds  of  imperfect  data  in  ILP  and  identifies  a  subset  of 
them  to  tackle  in  this  paper.  Section  4  is  a  brief  review  of  some  part  of  the 
Rough  Set  theory,  which  is  relevant  to  our  purpose  in  this  paper.  Section  5 
proposes  rough  problem  settings  for  incomplete  background  knowledge  and  for 
indiscernible  data,  and  discusses  the  related  work.  Finally  in  Section  6  we  sum¬ 
marize  our  work  and  point  out  the  future  research  directions. 


2  The  Normal  Problem  Setting  for  ILP 

We  follow  the  notations  of  |8].  Especially,  supposing  C  is  a  set  of  clauses  {ci ,  C2 , . . , 
we  use  C  to  denote  the  set  {'"ci,~C2, . . .}.  The  normal  problem  setting  for  ILP 
can  be  stated  as  follows: 

Given  the  positive  examples  E'^  and  the  negative  examples  E~  (both  are 
sets  of  clauses)  and  the  background  knowledge  B  (a  finite  set  of  clauses), 

ILP  is  to  find  a  theory  H  (a  finite  set  of  clauses)  which  is  correct  with 
respect  to  £*+  and  E~ ,  That  demands: 

“  U  B  \=  e  (completeness  wrt.  ^+); 

-  HU  B\J  E~  is  satisfiable  (consistency  wrt.  E~). 

The  above  ILP  problem  setting  is  somewhat  too  general.  In  most  of  the  ILP 
literature,  the  following  simplifications  are  assumed: 

-  Single  predicate  learning.  The  concept  to  be  learned  is  represented  by  a 
single  predicate  p  (called  the  Target  predicate).  Examples  are  instances  of  the 
target  predicate  p  and  the  induced  theory  is  the  defining  clauses  of  p.  Only 
the  background  knowledge  B  may  contain  definitions  of  other  predicates 
which  can  be  used  in  the  defining  clauses  of  the  target  predicate. 
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—  Restricted  within  definite  clauses.  All  clauses  contained  in  B  and  H  are 
definite  clauses,  and  the  examples  are  ground  atoms  of  the  target  predicate. 
We  can  prove  that  in  this  case  the  condition  of  consistency  has  an  equivalent 
form:  supposing  that  is  a  set  of  definite  clauses,  E~"  is  a  set  of  ground 
atoms,  then  E  is  consistent  with  respect  to  E~  if  and  only  if 

This  form  is  more  operational  than  the  general  condition  (i.e.,  E  U  E~  is 
satisfiable). 

This  paper  also  takes  these  simplifications.  For  the  convenience  of  later  refer¬ 
ence,  here  we  restate  the  (simplified)  normal  problem  setting  for  ILP  in  a  more 
formal  way: 

Given: 

—  The  target  predicate  p. 

—  The  positive  examples  E^  and  the  negative  examples  E~  (two  sets  of  ground 
atoms  of  p). 

—  Background  knowledge  B  (a  finite  set  of  definite  clauses). 

To  find: 

—  Hypothesis  H  (the  defining  clauses  of  p)  which  is  correct  with  respect  to  E"^ 
and  E~j  that  is, 

1.  H  UB  is  complete  with  respect  to  E'^  (i.e.  Vggfj+iJ  \J  B  |=  e).  We  also 
say  that  HUB  covers  all  positive  examples. 

2.  H  UB  is  consistent  with  respect  to  E~  (i.e.  ^eeE-^  U  B  e).  We  also 
say  that  HUB  rejects  any  negative  examples. 

To  make  the  ILP  problem  meaningful,  we  assume  the  following  prior  condi¬ 
tions: 

—  B  is  not  complete  with  respect  to  E'^ .  (Otherwise  there  will  be  no  learning 
task  at  all,  because  the  background  knowledge  itself  is  the  solution). 

—  BUE'^  is  consistent  with  respect  to  E~  (Otherwise  there  will  be  no  solution 
to  the  learning  task). 

In  the  above  normal  problem  setting  for  ILP,  everything  is  assumed  correct 
and  perfect.  But  in  large,  real-world  empirical  learning,  data  are  not  always 
perfect.  In  contrary,  uncertainty,  incompleteness,  vagueness,  impreciseness,  etc. 
are  frequently  observed  in  training  examples,  in  background  knowledge,  as  well 
as  in  the  induced  hypothesis.  Thus  ILP  has  to  deal  with  imperfect  data.  In 
this  aspect,  the  theory,  measurement,  techniques  and  experiences  are  much  less 
mature  for  ILP  than  in  the  traditional  attribute- value  learning  methods.  This 
paper  addresses  this  problem,  focusing  on  the  potential  role  of  the  Rough  Set 
theory  in  it. 
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3  Imperfect  Data  in  ILP 

We  distinguish  five  kinds  of  imperfect  data  encountered  in  ILP: 

1.  Imperfect  output 

In  ILP,  even  the  input  data  (examples  and  background  knowledge)  are  “per¬ 
fect”,  there  are  usually  more  than  one  hypotheses  which  can  be  induced  and 
the  preferential  order  among  hypotheses  is  an  important  issue.  If  the  input 
data  is  imperfect  (see  below),  the  situation  is  more  serious:  we  have  imperfect 
hypotheses.  At  present,  quantitative  measures  associated  with  hypotheses  in 
ILP  are  not  as  rich  as  those  of  the  attribute- value  learning  [15]. 

2.  Noise  data 

This  includes  erroneous  argument  values  in  examples,  and/or  erroneous  clas¬ 
sification  of  examples  as  belonging  to  £’+  or  E" .  The  ILP  community  has 
made  some  advances  in  noise-handling,  using  heuristics  to  avoid  overly  spe¬ 
cific  hypotheses  which  will  have  low  prediction  accuracy  (see  [2]  for  details). 
The  ideas  come  from  the  similar  techniques  developed  within  the  attribute- 
value  learning  framework. 

3.  Too  sparse  data 

This  means  that  the  training  examples  are  too  sparse  to  induce  reliable 
hypothesis  H.  The  noise-handling  mechanisms  mentioned  above  usually  also 
take  care  of  too  sparse  data.  Zhong[16,  17]  proposes  a  mechanism  considering 
unseen  instances,  which  can  be  also  extended  to  ILP. 

4.  Missing  data 

(a)  Missing  values 

This  means  that  some  arguments  of  some  examples  have  unknown  values. 
A  simple  way  to  deal  with  this  problem  is  to  induce  a  missing  value  from 
other  examples  (e.g.  the  value  occurring  in  the  same  argument  place  of 
the  majority  of  other  examples). 

(b)  Missing  predicates 

This  means  that  the  background  knowledge  B  lacks  essential  predicates 
(or  essential  clauses  of  some  predicates)  so  that  no  non-trivial  hypothesis 
H  can  be  induced.  (Note  that  itself  can  be  always  regarded  as  a 
hypothesis,  but  it  is  trivial).  Especially,  even  though  a  large  amount 
of  positive  examples  are  given,  some  examples  are  not  generalized  by 
hypotheses  if  some  background  knowledge  is  missing.  This  is  a  big  topic 
in  the  research  area  of  ILP.  In  recent  study  of  Muggleton,  has  taken  some 
important  steps  in  the  field  of  ILP  [5]. 

5.  Indiscernible  data 

This  means  that  some  examples  belong  to  both  E"^  and  .  In  this  case,  the 
prior  condition  2’  (B  U  £*+  is  consistent  with  respect  to  E~)  in  the  normal 
setting  is  not  satisfied,  so  there  will  be  no  solution  to  the  learning  task. 

As  the  above  list  clearly  shows,  imperfect  data  handling  is  a  too  vast  task 
to  tackle  in  one  paper.  In  the  following  of  this  paper,  we  will  concentrate  on 
item  4(b)  (Missing  predicates)  and  item  5  (Indiscernible  data).  In  both  cases, 
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the  requirement  in  the  normal  problem  setting  of  ILP  that  H  should  be  “correct 
with  respect  to  £•+  and  £:"•”  needs  to  be  relaxed,  otherwise  there  will  be  no 
(meaningful)  solutions  to  the  ILP  problem.  We  will  give  rough  problem  settings 
in  the  cases  of  missing  predicates  and  indiscernible  data,  using  concepts  from 
the  Rough  Set  theory  [9,  10]. 

4  Rough  Set  Theory 

The  Rough  Set  theory  is  a  powerful  mathematical  model  of  imprecise  infor¬ 
mation.  For  reader’s  convenience,  here  we  review  some  concepts  in  the  theory 
[9,  10,  14]  which  are  relevant  to  our  rough  problem  settings  of  ILP  presented  in 
the  next  section. 

Approximation  space  A  =r  (U,  R).  Here  C/  is  a  set  (called  the  universe)  and 
R  is  an  equivalence  relation  on  U  (called  an  indiscernibility  relation).  In  fact,  U 
is  partitioned  by  R  into  equivalence  classes,  elements  within  an  equivalent  class 
are  indistinguishable  in  the  approximation  space  A, 

Lower  and  upper  approximations.  For  an  equivalence  relation  R,  the  lower 
and  upper  approximations  oi  X  CU  are  defined  by 

^^(X)=  U  [x]R  =  {xeu\[x]RCX}  (1) 

—  U  G  I  Hi?  n  X  ^  0}  (2) 

where  [x]r  denotes  the  equivalence  class  containing  x.  Furthermore,  Apr^(X) 
can  be  simply  denoted  as  Apr{X)  when  A  is  implicit. 

Boundary.  Bud^iX)  =  A^ a(X)  -  Apr  ^(X)  is  called  the  boundary  of  X  in 

A. 

Rough  membership. 

—  element  x  surely  belongs  to  X  in  if  a;  G  ^(X); 

—  element  x  possibly  belongs  to  X  in  ^  if  a;  G  Apr^{X); 

—  element  x  surely  does  not  belong  to  X  in  .A  if  a;  ^ 

5  Rough  Problem  Settings  for  ILP 

5.1  Rough  Problem  Setting  for  Missing  Predicates/ Clauses 
Considering  the  following  ILP  problem  (adapted  from  [1]): 

Given: 

—  The  target  predicate  “customer (Name,  Age,  Sex,  Income)” 
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-  The  positive  examples  E'^  : 

customer(a,  30,  female,  1). 
customer(b,  53,  female,  100). 
customer(d,  50,  female,  2). 
customer(e,  32,  male,  10). 
customer(f,  55,  male,  10). 

-  The  negative  examples  E"  : 

customer(c,  50,  female,  2). 
customer(g,  20,  male,  2). 

—  Background  knowledge  B  defining  predicate  “married.to(Husband,  Wife)” 

by 

married_to(e,  a). 
married_to(f,  d). 

To  find: 

—  Hypothesis  H  (the  definition  of  customer/4)  which  is  correct  with  respect 
to  E"^  and  E~ . 

The  normal  problem  setting  (see  Section  2)  is  perfectly  suitable  for  this  prob¬ 
lem,  and  an  ILP  system  can  induce  the  following  hypothesis  H  (a  Prolog  program 
defining  cusiomer/4): 

customer(N,  A,  S,  !):-!>  10. 

customer(N,  A,  S,  I)  married-to(N\  N),  customer(N’,  A’,  S’,  P). 

However,  if  predicate  married  Jo/2  (or  its  second  clause  “married-to(f,  d)”)  is 
missing  in  the  background  knowledge  B  (and  no  other  predicates/clauses  in 
B  that  tell  any  essential  difference  between  persons  c  and  d),  no  meaningful 
hypothesis  will  be  induced,  because  no  Prolog  program  defining  cusiomer/4  can 
explain  why  person  d  is  a  customer  while  person  c  is  not,  given  the  fact  that 
except  their  Names,  all  descriptions  of  the  two  persons  are  the  same. 

This  illustrates  that  even  a  learning  task  can  be  expressed  in  the  normal 
problem  setting  for  ILP,  it  is  possible  that  no  meaningful  hypothesis  can  be 
induced  due  to  the  lack  of  essential  predicates/clauses  in  the  background  knowl¬ 
edge.  In  order  to  learn  something  useful  in  these  cases,  the  requirement  in  the 
normal  problem  setting  of  ILP  that  H  should  be  “correct  with  respect  to  E'^ 
and  E~”  has  to  be  relaxed.  We  propose  the  following  rough  problem  setting  for 
incomplete  background  knowledge. 
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Rough  Problem  Setting  1  (for  missing  predicate/clauses) 

Given; 

-  The  target  predicate  p  (the  set  of  all  ground  atoms  of  p  is  U) 

-  An  equivalent  relation  RonU  (we  have  the  approximation  space  A  =  (J/,  R)) 

-  CU  and  E~  C  U  satisfying  the  prior  condition:  B  U  is  consistent 
with  respect  to  E~ 

-  Background  knowledge  B  (may  lack  essential  predicates/ clauses) 
Considering  the  following  rough  sets: 

-  ^;++  =  A^^(£'+),  containing  all  positive  examples,  and  those  negative 

examples  E~'^  =  {e'  €  |3egf;+ei2e'} 

~  E —  z=:  E~  -  E~'^y  containing  the  “pure”  (remaining)  negative  examples 

-  £*+4.  =  Apr^(£+),  containing  the  “pure”  positive  examples.  That  is,  £++  = 
£-+  _  E~,  where  E+~  =  {e  G 

-  E _ =  E''  A  E'^~  containing  all  negative  examples  and  “non-pure”  positive 

examples 

To  find: 

-  Hypothesis  £+  (the  defining  clauses  of  p)  which  is  correct  with  respect  to 
£++  and  E  ,  that  is, 

•  \J  B  covers  all  examples  of  £++. 

•  \J  B  rejects  any  examples  of  E  . 

-  Hypothesis  H~  (the  defining  clauses  of  p)  which  is  correct  with  respect  to 

E^+  and  E _ ,  that  is, 

•  H~  U  B  covers  all  examples  of  £+-i-. 

•  U  B  rejects  any  examples  of  £ - 

Returning  to  our  illustrating  example,  where  predicate  Married  Jo/2  is  miss¬ 
ing  in  the  background  knowledge  B.  Let  R  he  defined  as 

“customer(N,  A,  S,  I)  R  customer(N’,  A,  S,  I)”, 

with  the  rough  problem  setting  1,  we  may  induce  £+  as: 

customer (N,  A,  S,  I)  :-  I  >  10. 
customer(N,  A,  S,  I)  :-  S  =  female. 

, which  covers  all  positive  examples  and  the  negative  example  “customer (c,  50, 
female,  2)”,  rejecting  other  negative  examples.  We  may  also  induce  H~  as: 

customer(N,  A,  S,  I)  :-  I  >  10. 
customer(N,  A,  S,  I)  :-  S  =  female,  A  <  50. 

which  covers  all  positive  examples  except  “customer(d,  50,  female,  2)”,  rejecting 
all  negative  examples. 
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These  hypotheses  are  rough  (because  the  problem  itself  is  rough),  but  still 
useful.  On  the  other  hand,  if  we  insist  in  the  normal  problem  setting  for  ILP, 
these  hypotheses  are  not  considered  as  “solutions” . 


5.2  Rough  Problem  Setting  for  Indiscernible  Examples 

In  the  KDD  community,  people  have  to  deal  with  situations  such  as:  two  pa¬ 
tients  showing  the  same  symptoms  have  got  different  diagnostic  results;  two 
person  records  in  a  database  having  the  same  values  for  condition  attributes 
have  different  decision  attribute  values;  etc.  In  ILP,  the  similar  situation  is  that 
example  e  belongs  to  as  well  as  to  E~ . 

In  the  illustration  given  in  Section  5.1,  if  we  ignore  the  person  names,  the 
target  predicate  will  be  “customer(Age,  Sex,  Income)”  and  we  will  encounter 
indiscernible  examples:  “customer(50,  female,  2)”  belongs  to  E^  as  well  as  to 
E~ .  Then  the  problem  cannot  be  expressed  in  the  normal  problem  setting  at 
all,  because  the  prior  condition  2’  (B  U  E'^  is  consistent  with  respect  to  E~)  is 
violated.  In  order  to  learn  something  useful  in  these  cases,  the  requirement  in 
the  normal  problem  setting  of  ILP  that  H  should  be  “correct  with  respect  to 
E'^  and  E"”  has  also  to  be  relaxed.  We  propose  the  following  rough  problem 
setting  for  indiscernible  examples,  which  essentially  is  a  special  case  of  the  above 
rough  setting  1. 

Rough  Problem  Setting  2  (for  indiscernible  examples) 

Given: 

-  The  target  predicate  p  (the  set  of  all  ground  atoms  of  p  is  U) 

-  E'^  C  U  and  E~  CU  where  E'^  0  E~ 

-  Background  knowledge  B 

The  rough  sets  to  consider  and  the  hypotheses  to  find: 

-  Taking  the  identity  relation  /  as  a  special  equivalent  relation  Rj  the  remain¬ 
ing  description  of  rough  setting  2  is  the  same  as  in  rough  setting  1. 

5.3  Related  Work 

Siromoney[12]  also  tries  to  apply  the  Rough  Set  theory  to  ILP.  It  considers 
A  =  {U,R)  where  U  is  the  set  of  all  possible  positive  and  negative  examples,  the 
equivalent  relation  R  on  U  is  defined  as  eRe*  iff  for  any  H  which  can  be  induced, 
either  H  \=  e,H  \=  or  H  e'.  Then  it  considers  the  concept  to  be 

learn  as  a  subset  C  of  C/,  and  points  out  that 

£■+  C  Apr^jC)  and  E'  C  17  -.I^^(C'). 

However  it  did  not  distinguish  different  kinds  of  imperfect  data,  nor  give  any 
problem  settings  for  ILP.  We  think  this  is  not  surprising,  because  R  and  C  used 
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in  [12]  are  posterior  in  the  sense  that  the  user  does  not  know  them  prior.  In 
contrary,  we  use  user-defined  R,  and  E-^,E-  etc.  (all  known  to  the  user)  as  the 
start  point,  so  we  can  give  rough  problem  settings  for  ILP,  and  it  is  possible  to 
develop  ILP  systems  allowing  the  user  to  specify  rough  problem  settings,  as  well 
as  the  normal  problem  setting.  These  ILP  systems  will  be  able  to  induce  useful 
hypotheses  even  when  the  input  data  are  imperfect. 

6  Conclusions  and  Future  Work 

This  paper  addresses  the  problem  of  imperfect  data  handling  in  Inductive  Logic 
Programming  (ILP).  We  discuss  various  kinds  of  imperfect  data  in  ILP,  and  ap¬ 
ply  the  Rough  Set  theory  to  incomplete  background  knowledge  (where  essential 
predicates /clauses  are  missing)  and  to  indiscernible  data  (where  some  examples 
belong  to  both  sets  of  positive  and  negative  training  examples),  proposing  rough 
problem  settings  for  these  cases.  The  rough  settings  relax  the  strict  require¬ 
ments  in  the  standard  normal  problem  setting  for  ILP ,  so  that  rough  but  useful 
hypotheses  can  be  induced  from  imperfect  data. 

Future  work  in  this  direction  includes: 

—  Trying  to  apply  the  Rough  Set  theory  to  other  kinds  of  imperfect  data  (noise 
data,  too  sparse  data,  missing  data,  etc.)  in  ILP. 

—  Giving  quantitative  measures  associated  with  hypotheses  induced  within  the 
rough  problem  settings  of  ILP,  using  appropriate  concepts  and  techniques 
from  the  Rough  Set  theory. 

—  Developing  a  new  ILP  system  (or  extend  an  existing  ILP  system)  which 
allows  the  user  to  specify  rough  problem  settings,  as  well  as  the  normal 
problem  setting.  The  new  ILP  system  will  be  able  to  induce  useful  hypotheses 
even  when  the  input  data  are  imperfect. 
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Abstract.  Practical  machine  learning  algorithms  are  known  to  degrade 
in  performance  when  faced  with  many  features  that  are  not  necessary 
for  rule  discovery.  To  cope  with  this  problem,  many  methods  for  select¬ 
ing  a  subset  of  features  with  similar-enough  behaviors  to  merit  focused 
analysis  have  been  proposed.  In  such  methods,  the  filter  approach  that 
selects  a  feature  subset  using  a  preprocessing  step,  and  the  wrapper  ap¬ 
proach  that  selects  an  optimal  feature  subset  from  the  space  of  possible 
subsets  of  features  using  the  induction  algorithm  itself  as  a  part  of  the 
evaluation  function,  are  two  typical  ones.  Although  the  filter  approach 
is  a  faster  one,  it  has  some  blindness  and  the  performance  of  induction 
is  not  considered.  On  the  other  hand,  the  optimal  feature  subsets  can 
be  obtained  by  using  the  wrapper  approach,  but  it  is  not  easy  to  use 
because  the  complexity  of  time  and  space.  In  this  paper,  we  propose 
an  algorithm  of  using  the  rough  set  methodology  with  greedy  heuristics 
for  feature  selection.  In  our  approach,  selecting  features  is  similar  as  the 
filter  approach,  but  the  performance  of  induction  is  considered  in  the 
evaluation  criterion  for  feature  selection.  That  is,  we  select  the  features 
that  damage  the  performance  of  induction  as  little  as  possible. 


1  Introduction 

Generally  speaking,  the  purpose  of  building  databases  in  most  organizations  is 
that  of  managing  information  sources  effectively.  In  other  words,  data  are  rarely 
specially  collected/stored  in  a  database  for  the  purpose  of  mining  knowledge  in 
most  organizations.  Hence,  a  database  always  contains  a  lot  of  attributes  that 
are  redundant  and  not  necessary  for  rule  discovery.  If  these  redundant  attributes 
do  not  be  removed,  not  only  the  time  complexity  of  rule  discovery  will  increases, 
but  also  the  quality  of  the  discovered  rules  may  be  much  degraded. 

Which  attribute  should  be  deleted  is  very  difficult  to  decide  for  non-experts 
and  even  for  experts.  Clearly,  we  need  additional  methods  for  selecting  the  fea¬ 
ture  subset.  The  problem  of  feature  subset  selection  is  that  of  finding  an  optimal 
subset  of  features  of  a  database  according  to  some  criterion,  so  that  a  classifier 
with  the  highest  possible  accuracy  can  be  generated  by  an  inductive  learning 
algorithm  that  is  run  on  data  containing  only  the  subset  of  features. 

Many  researchers  have  investigated  this  field  and  several  methods  have  been 
proposed  [1,  3,  11,  6,  5].  A  kind  of  these  methods  is  that  of  ranking  features 
first  according  to  evaluation  measures  such  as  consistency,  information,  distance, 
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and  dependence,  and  then  selecting  the  features  with  a  higher  rank.  This  kind 
of  methods  considers  only  data  but  not  the  classifying  properties.  The  filter  ap¬ 
proach  belongs  to  this  type.  Another  kind  of  methods  such  as  wrapper  approach 
is  that  of  using  the  induction  algorithm  itself  as  an  evaluation  function  for  se¬ 
lecting  the  optimal  subsets  from  the  space  of  all  possible  subsets.  Furthermore, 
the  rough  set  theory  provides  a  mathematical  tool  to  find  out  all  of  possible 
feature  subsets  (called  reducts)[l].  Unfortunately,  the  number  of  possible  subsets 
are  always  very  large  when  N  is  large  because  there  are  subsets  for  N 

features.  Therefore  examining  exhaustively  all  subsets  of  features  for  selecting 
the  optimal  one  is  NP-hard.  Most  practical  algorithms  attempt  to  be  fit  for  the 
data  by  solving  the  NP-hard  optimization  problem  [4]. 

In  this  paper,  we  propose  an  algorithm  of  using  the  rough  set  theory  with 
greedy  heuristics  for  feature  selection.  We  attempt  to  find  an  approach  that  is 
not  heavy  but  effective.  In  our  approach,  features  are  selected  from  the  space  of 
features  but  no  the  space  of  reducts,  and  using  the  evaluation  criterion  in  which 
the  performance  of  induction  is  considered.  That  is,  we  select  the  features  that 
damage  the  performance  of  induction  as  little  as  possible. 


2  Dispensable  and  Indispensable  Features 

In  the  rough  set  theory,  a  decision  table  is  denoted  T  =  (C/,  A,  C,  D),  where 
U  is  universe  of  discourse,  A  is  a  family  of  equivalence  relations  over  U,  and 
C,D  C  A  are  two  subsets  of  features  that  are  called  condition  and  decision 
features,  respectively [1]. 

Before  describing  what  are  the  dispensable  and  indispensable  features,  some 
basic  terms  and  notations  on  the  rough  set  theory  must  be  explained  first. 

Lower  Approximations: 

The  lower  approximations  of  a  set,  RX^  is  the  set  of  all  elements  of  U  which 
can  be  with  certainty  classified  as  elements  of  A,  in  the  knowledge  R,  where 
X  CU.lt  can  be  presented  formally: 

RX  =  U{y  eU/R:YCX) 


The  Positive  Region: 

We  also  using  a  positive  region  to  denote  the  lower  approximations  of  a  set. 
Let  P  and  Q  be  equivalence  relation  over  U ^  P  C  U  and  Q  C  U.  The  P- 
positive  region  of  Q,  POSp{Q),  is  the  set  of  all  objects  of  universe  U  which  can 
be  properly  classified  to  classes  of  U/Q  employing  knowledge  expressed  by  the 
classification  U/P. 


POSpiQ)=  U  RX 

X€U/Q 


Dispensable  and  Indispensable  Features: 

The  dispensable  and  indispensable  features  are  defined  as  follows: 
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Let  c  eC.  K  feature  c  is  dispensable  in  T,  if  POS(^c~c)iI^)  —  POSc(D)] 
otherwise  feature  c  is  indispensable  in  T. 

If  c  is  an  indispensable  feature,  deleting  it  from  T  will  cause  T  to  be  incon¬ 
sistent.  Otherwise,  c  can  be  deleted  from  T.  T  =  (U,A,CjD)  is  independent  if 
all  c  G  (7  are  indispensable  in  T. 

Reduci: 

The  set  of  features  RCC  will  be  called  a  reduct  of  O',  if  T  =  {Uj  A,  R,  D)  is 
independent  and  POSr{D)  =  POSc{D). 

CORE: 

The  set  of  all  indispensable  features  in  (C,  D)  will  be  denoted  by  CORE{C,  D). 
CORE{C,  i3)  =  n  RED{C,  D) 
where  RED{C^  D)  is  the  set  of  all  reducts  of  {C,D). 


3  Searching  Indispensable  Features 

All  of  indispensable  features  should  be  contained  in  an  optimal  feature  subset, 
because  removing  any  of  them  will  cause  inconsistent  in  a  decision  table.  As 
defined  in  Section  2,  CORE  is  the  set  of  all  indispensable  features.  Hence  the 
process  of  searching  indispensable  features  is  that  of  finding  CORE. 

The  discernibility  matrix  proposed  by  Skowron  [2,  1]  can  be  used  for  CORE 
searching.  The  basic  idea  of  the  discernibility  matrix  can  be  briefly  presented  as 
follows: 

Let  T  =  (f7,A,  (7,L>)  be  a  decision  table,  with  U  —  {t/i,U2,  •  •  - » ««}•  By  a 
discernibility  matrix  of  T,  denoted  M{T),  we  will  mean  n  x  n  matrix  defined 
thus: 


rriij  =  {a  eC  :  a(ui)  ^  a{uj)  A  D{ui)  ^  D(uj)  }  for  ij  =  1, 2, . . . ,  n. 

Thus  entry  m,j  is  the  set  of  all  attributes  that  discern  objects  and  Uj . 

The  CORE  can  be  defined  now  as  the  set  of  all  single  element  entries  of  the 
discernibility  matrix,  that  is, 

CORE{C)  =  {a  €  C  :  rriij  =  (a),  for  some  i,  j  }. 

Searching  CORE  is  to  search  such  a  set  of  features  in  which  each  feature  is 
unique  to  discern  some  objects. 

Example 

The  discernibility  matrix  corresponding  to  the  sample  database  (the  decision 
table)  shown  in  Table  1  with  U  =  {ui,  «2)  •  •  • » «?}»  C  =  {a, 6,  c},  D  =  {d}  is  as 
follows: 
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Table  1.  A  Sample  Database 


a 

b 

c 

d 

E 

ul 

1 

0 

2 

1 

1 

u2 

1 

0 

2 

0 

1 

u3 

1 

2 

0 

0 

2 

u4 

1 

2 

2 

1 

0 

u5 

2 

1 

0 

0 

2 

u6 

2 

1 

1 

0 

2 

u7 

2 

1 

2 

1 

1 

ul  u2  u3  u4  u5  u6 

u2 

- 

u3 

b,c,d  b,c 

u4 

b  b,d  c,d 

u5 

a,b,c,d  a,b,c  -  a,b,c,d 

u6 

a,b,c,d  a,b,c  -  a,b,c,d- 

u7 

-  -  a,b,c,d  a,b  c,d  c,d 

The  CORE  is  the  feature  6.  We  can  see  that  6  is  the  unique  feature  for  discerning 
itl  and  uA.  Furthermore,  two  reducts  are  {b,  c}  and  {b,  d}.  Since  the  feature  a 
is  not  contained  in  any  reduct,  it  could  be  deleted. 


4  Feature  Subset  Selection 

An  optimal  feature  subset  selection  based  on  the  rough  set  theory  can  be  viewed 
as  finding  such  a  reduct  R,  R  C  C  with  the  best  classifying  properties.  R  will 
be  used  to  instead  of  (7  in  a  rule  discovery  algorithm. 

Selecting  an  optimal  reduct  R  from  all  subsets  of  features  is  not  an  easy 
work.  Considering  the  combinations  among  N  features,  the  number  of  possible 
reducts  is  2^“^.  Hence,  selecting  the  optimal  reduct  from  all  of  possible  reducts 
is  NP-hard.  For  this  reason,  many  methods  for  finding  approximate  results  have 
been  proposed  [1,  3,  11,  6,  5].  However  the  features  in  CORE  must  be  included 
whatever  in  an  optimal  result  or  in  an  approximate  result.  It  is  obvious  that 
all  of  indispensable  features  in  CORE{Cy  D)  cannot  be  deleted  from  C  if  the 
accuracy  of  a  decision  table  is  not  changed  (dropped).  The  feature(s)  in  CORE 
must  be  the  member  of  feature  subsets.  Note  that  not  all  of  the  features  in  an 
optimal  feature  subset  must  be  indispensable.  Therefore,  The  problem  of  feature 
subset  selection  will  become  how  to  select  the  features  from  dispensable  features 
for  forming  the  best  reduct  with  CORE.  We  use  COREi^C^  D)  as  the  core  of 
feature  subsets.  If  CORE  is  not  a  reduct  of  (C,  jD),  some  of  dispensable  features 
must  be  selected  to  make  up  a  reduct. 

Basically,  the  feature  selection  approaches  can  be  divided  into  two  types:  the 
filter  approach  and  the  wrapper  approach  [11]. 
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4.1  The  Filter  Approach 

The  filter  approach  selects  the  features  using  a  preprocessing  step.  The  main 
disadvantage  of  the  filter  approach  is  that  it  totally  ignores  the  effect  of  the 
selected  feature  subset  on  the  performance  of  the  induction  algorithm. 

The  main  feature  selection  strategies  of  the  filter  approach  are  as  follows: 

1.  The  minimal  subset  of  features  (the  MINIM AL_FEATURES  bias). 

This  bias  has  severe  implications  when  applied  blindly  without  regard  for 
the  resulting  induced  concept.  For  example,  the  ID  number  of  the  patient  in 
a  medical  diagnosis  data  may  be  picked  as  the  only  feature  needed.  Given 
only  the  ID  number,  any  induction  algorithm  is  expected  to  generalize  very 
poorly. 

2.  Selecting  the  features  with  a  higher  rank. 

Ranking  a  list  of  features  according  to  some  measures.  A  measure  can  be 
based  on  any  of  accuracy,  consistency,  information,  distance,  and  depen¬ 
dence.  However,  this  bias  does  not  help  with  a  redundant  feature.  Moreover, 
it  may  not  be  wise  to  use  this  bias  on  the  data  in  which  some  irrelevant 
feature  is  strongly  correlated  to  the  class  feature. 

4.2  The  Wrapper  Approach 

In  the  if;rflpper  approach,  the  features  subset  selection  is  done  using  the  induction 
algorithm  as  a  black  box.  A  search  for  a  good  subset  is  done  using  the  induction 
algorithm  itself  as  a  part  of  the  evaluation  function. 

The  wrapper  approach  conducts  a  search  in  the  space  of  possible  subsets  of 
feature.  For  example,  the  space  of  reducts.  There  are  several  search  methods 
that  can  be  used  for  the  wrapper  approach, 

-  Exhaustive/Complete  search 

-  Heuristic  search 

-  Nondeterministic  search 

and  so  on. 

When  the  number  of  features  N  is  small,  the  search  space  may  be  not  so  large, 
but  it  grows  exponentially  when  N  increases.  In  general,  given  a  search  space, 
the  more  you  search  it,  the  better  the  subset  you  can  find.  But  the  resource  is 
not  unlimited,  we  have  to  sacrifice  optimality  of  selected  subsets.  The  sacrifice 
ha<s  also  a  limit,  we  must  keep  the  optimality  of  a  feature  subset  as  much  as 
possible  while  spending  as  little  search  time  as  possible. 

Exhaustive/Complete  search  exhausts  all  possible  subsets  and  find  the  op¬ 
timal  ones.  It  is  obvious  that  no  optimal  subset  can  possibly  be  missed.  The 
number  of  possible  subsets  is  2^”^,  so  that  the  time  complexity  of  searching  all 
of  them  is  0(2^”^).  Using  heuristics  in  search  avoids  brute-force  search,  but  at 
the  same  time  risks  losing  optimal  subsets.  Heuristic  search  is  obviously  much 
faster  than  exhaustive  search  since  it  only  searches  a  part  of  subsets  and  finds  a 
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near-optimal  subset.  Nondeterministic  search  is  also  called  random  search  strate¬ 
gies.  Searching  for  the  next  set  at  random,  that  is,  a  current  set  dose  not  directly 
grow  or  shrink  from  any  previous  set  following  a  deterministic  rule.  There  are 
two  characteristics:  (1)  do  not  need  to  wait  until  the  search  end;  (2)  do  not  know 
when  the  optimal  set  shows  up,  although  we  know  a  better  one  appears  when  it 
is  there. 

5  Heuristics  for  Feature  Subset  Selection 

In  this  section,  we  describe  our  approach  for  feature  subset  selection.  The  data 
we  faced  are  almost  very  large  and  the  number  of  features  are  a  quite  many,  so  we 
select  the  heuristic  search  as  our  search  strategy,  because  exhaustive/ complete 
search  is  too  time  consuming  and  nondeterministic  search  is  difficult  to  know 
when  the  optimal  subset  appears.  Although  the  heuristic  search  cannot  guarantee 
that  the  result  must  be  the  best  one,  it  is  a  better  way  for  solving  very-large, 
complex  problems  [8]. 

A  search  is  invalid  if  it  totally  ignores  the  effect  of  the  selected  feature  subset 
on  the  performance  of  the  induction  algorithm.  Using  an  induction  algorithm 
itself  as  a  part  of  the  evaluation  function  like  the  wrapper  approach,  no  doubt, 
a  good  subset  can  be  searched.  However,  evaluating  all  subsets  of  features,  even 
evaluating  just  a  part  of  feature  subsets  selected  by  some  strategy,  are  also  time 
consuming. 

For  selecting  feature  subsets  from  a  large  database  with  a  lot  of  features,  we 
select  the  best  features  one  by  one  by  using  the  evaluation  criterion  in  an  induc¬ 
tion  algorithm,  until  a  reduct  is  found.  However,  unlike  the  wrapper  approach, 
we  do  not  select  the  best  feature  subset  from  all  of  possible  subsets  of  features. 

The  evaluation  criterion  used  in  our  feature  selection  approach  is  that  of 
the  rule  selection  used  in  the  rule  discovery  system,  GDT-RS,  developed  by  us 
[9,  10]: 

1.  Selecting  the  rules  that  cover  as  many  instances  as  possible; 

2.  Selecting  the  rules  that  contain  as  little  features  as  possible,  if  they  cover 
the  same  number  of  instances; 

3.  Selecting  the  rules  with  larger  strengths,  if  they  are  in  the  same  generaliza¬ 
tion  (condition  features)  level  and  cover  the  same  number  of  instances. 

Where  the  strength  of  a  generalization  is  related  to  the  number  of  values  in  each 
feature  in  the  generalization.  The  more  the  number  of  values  the  stronger  the 
generalization. 

A  feature  subset,  is  good  or  not,  depends  on  the  strengths  of  the  rules  dis¬ 
covered  by  using  this  subset.  The  strong  the  strength,  the  better  the  subset.  To 
select  the  feature  that  is  of  benefit  to  acquire  the  rules  with  a  larger  cover  rate 
and  a  strong  strength,  the  following  selection  strategies  are  used: 

—  To  obtain  the  subset  of  features  as  small  as  possible,  selecting  the  features 
that  cause  the  number  of  consistent  instances  increases  faster. 


To  avoid  the  features  with  a  lot  of  attribute  values  such  as  the  ID  number  or 
continuous  attributes  are  considered  first,  not  only  the  preprocess  of  delet¬ 
ing  unnecessary  attribute  and  the  discretization  of  continuous  attributes  is 
important,  but  also  the  features  that  can  generate  the  strong  rules  must  be 
selected  first.  The  size  of  the  maximal  subset  in  POSr{D)IIND{{R,D}) 
should  be  considered  since  it  affects  the  strengths  of  rules.  In  general,  the 
more  the  number  of  attribute  values  in  a  feature  in  the  more  the  num¬ 
ber  of  subsets,  and  the  smaller  the  size  of  the  maximal  subset.  Selecting  a 
feature,  by  which  a  bigger  subset  can  be  acquired,  is  a  way  for  our  purpose. 

Let  a  cardinality  of  the  lower  approximation  of  a  set,  card  POSr(D),  denote 
the  number  of  consistent  instances,  max^ize{POSR{D)/IND{{R,  D}))  de¬ 
note  the  size  of  the  maximal  subset  of  the  lower  approximation  of  the  set 
POSr{D).  The  feature  selection  can  be  regarded  as  selecting  such  features: 
if  adding  them  into  the  subset  of  features,  R,  the  card  POSr{D)  increases 
faster  and  the  max.size{POSR{D)/IN D{{R,  D}))  is  bigger  than  adding 
other  features. 

—  When  two  features  have  the  same  performance  described  above,  the  one 
that  contains  a  littler  number  of  different  values  will  be  selected.  This  is  for 
guaranteeing  that  the  number  of  instances  covered  by  a  rule  is  as  many  as 
possible. 

Based  on  the  preparations  stated  above,  a  heuristic  algorithm  is  described 
below.  At  first,  we  use  the  features  in  CORE  as  the  initial  feature  subset,  and 
then  choose  the  features  from  dispensable  features  one  by  one  by  using  the 
strategies  stated  above,  and  add  them  into  the  feature  subset,  until  a  reduct  is 
achieved. 

A  Heuristic  Algorithm 


Let  Rhea,  subset  of  the  selected  features,  P  the  set  of  unselected  features, 
U  all  of  instances,  X  the  contradictory  instances,  and  EXPECTk  the 
threshold  of  the  accuracy. 

In  the  initial  state,  R  =  CORE(C,  D),P  =  C-  CORE(C,  D), 

X  =  U-  POSr{D). 

Step  1.  Calculate  the  dependent  degree, 

card{U--X) 

^  =  =  - . 

where  card  denotes  the  cardinality  of  the  set. 

If  k  >  EXPECTJe,  then  stop. 

Step  2,  For  each  p  in  P,  calculate 

V  =  card  POSR^{py(D). 

m  =  card  max.set{POSR^{p'^{D)IIN D{R-\- {p}^D)). 
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Step  S.  Choose  the  best  feature  p  with  the  largest  v  ♦  m,  do 

R  =  Rr\p, 

P  =  P-p; 

Step  4‘  Remove  all  of  the  consistent  instances  x  that  are  contained  in 
the  set  of  POSr{D)  from  X, 

Step  5.  Goto  back  to  Step  1. 


Example 

We  would  like  to  use  the  sample  database  shown  in  Table  1  as  an  example  to 
explain  how  to  get  the  feature  subset  using  this  algorithm.  In  Table  1,  a,  6,  c,  and 
dare  condition  features,  £?  is  a  decision  feature,  and  U  =  {«l,«2,u3,ii4,u6,«6,«7}. 
{b}  is  a  unique  indispensable  feature,  because  it  will  cause  inconsistent,  {oicjdi}  — ► 
and  {aiC2di}  Eo/i(  deleting  {b}. 

From  the  following  equivalence  classes, 

U/{b]  =  {{ul,u2},{u5,ii6,  u7},  {«3,«4}} 

U/{E}  =  {{u4},  {ul,u2,  «7},  {u3,  u5,  ii6}}, 

we  know  b-positive  region  of  E,  POSh{E),  is  {til,u2}.  Hence,  in  the  initial  state, 
R  =  {b},  P  =  {a,  c,  d},  and  X  =  {ii3,  u4,  u5,  u6,  u7}.  The  initial  state  is  shown 
as  follows: 


u 

b 

E 

u3 

2 

2 

u4 

2 

0 

u5 

1 

2 

u6 

1 

2 

u7 

1 

1 

Let  EXPECT  JC  =  1,  the  termination  condition  will  be  k>  1. 

Since  it  =  2/7  <  1,  R  is  not  a  reduct,  we  must  continue  to  select.  The  next 
candidate  is  a,  c  or  d.  Table  2  gives  the  results  of  adding  {a},  {c},  or  {d}  into 
i?,  respectively. 

Prom  Table  2  or  the  following  equivalence  classes, 

U/E  =  {{«4},  {«7},  {u3,  «5,  u6}}; 

U/{a,b}  =  {{it3,u4},  {u5,u6,u7}}; 

C//{6,  c]  =  {{u5},  {u6},  {u7},  {u3},  {u4}}; 

U/{b,d}  =  {{u5,u6},{u7},{u3},{u4}}; 

POSia,h}i^)  =  ^-, 

POS^i^c}{E)  =  =  {u3,u4,tz5,u6,u7}; 

max.size{POS^i^c}{E)/{b,c,E})  -  1; 
maxMze{POS{b,d}(E)/{b,d,E})  =  |{u5,u6}|  =  2, 
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Table  2.  Selecting  the  second  feature  from  P  =  {a,  c,  d}. 


we  can  see  that  selecting  the  feature  a  cannot  reduce  the  number  of  contradictory 
instances,  but  if  selecting  either  c  or  d,  all  of  instances  become  consistent.  Since 
the  maximal  set  is  in  the  U/{b,d,  E},  d  should  be  selected  first. 

After  adding  d  into  all  of  instances  are  consistent  and  must  be  removed 
from  U,  Hence  U  is  empty,  ^  =  1,  the  process  finished.  The  selected  feature 
subset  is  {b,  d}. 

6  Experiment  Results 

Using  the  algorithm  stated  in  Section  5,  we  have  tested  several  databases.  Some 
of  them  are  artificial;  Monkl,  MonkS;  some  of  them  are  well  made:  Mushroom, 
breast  cancer,  earthquake;  and  some  of  them  are  real  world  databases:  meningi¬ 
tis,  medical  treatment,  land-slide.  Table  3  shows  the  results  of  feature  selection 
on  these  datasets.  In  Table  3,  #attr_n,  #instji,  #CORE,  and  #attt_n(sel)  de¬ 
note  the  number  of  features  in  a  dataset,  the  number  of  instances,  the  number 
of  features  in  COREy  and  the  number  of  features  selected,  respectively. 


Table  3.  Results  of  feature  selection 
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7  Conclusions 

In  this  paper,  we  presented  an  approach  for  feature  selection.  It  is  based  on  the 
rough  set  theory  and  greedy  heuristics.  The  main  advantages  of  our  approach 
are  that  it  can  select  a  better  subset  of  features  quickly  and  effectively  from 
a  large  database  with  a  lot  of  features;  the  selected  features  do  not  damage 
the  performance  of  induction  so  much  since  the  performance  of  induction  is 
considered  in  the  evaluation  criterion  for  feature  selection. 
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Abstract.  Most  of  the  existing  discretization  approaches  discrete  each 
continuous  attribute  independently,  without  considering  the  discretization 
results  of  other  continuous  attributes.  Therefore,  some  unreasonable  and 
superfluous  discretization  split  points  are  usually  created.  Based  on 
compatibility  rough  set  model  and  genetic  algorithm,  a  global  discretization 
approach  has  been  provided.  The  experimental  results  indicate  that  the  global 
discretization  approach  proposed  can  significantly  decrease  the  number  of 
discretization  split  points  and  the  number  of  rules,  but  increase  the  predictive 
accuracy  of  the  classifier. 

1  Introduction 

In  the  practical  application  of  machine  learning  and  data  mining,  there  are  many 
continuous  attributes  where  some  symbolic  inductive  learning  algorithm  could  not  be 
applied  unless  the  continuous  attributes  are  first  discretized.  There  are  currently  several 
discretization  algorithms[2,5,9,14,15].  Most  are  independent  methods,  meaning  that 
they  discrete  each  continuous  attribute  independently,  without  considering  the 
discretization  results  of  other  continuous  attributes. 

The  final  discretization  result  of  a  set  of  continuous  attributes  mainly  depends  on 
the  locations  and  the  number  of  the  selected  discretization  split  points  that  come  from 
different  continuous  attributes.  Usually,  the  discretization  approaches  that  consider  the 
dependency  among  multiple  continuous  attributes  are  called  global  discretization.  By 
means  of  the  binarization  of  continuous  attributes[8],  the  problem  of  the  global 
discretization  of  continuous  attributes  was  converted  into  the  problem  of  selecting  the 
simplest  subset  of  binary  attributes[9].  However,  the  problem  with  the  global 
discretization  approach  is  that  the  number  of  the  initial  binary  attributes  is  extremely 
large,  because  all  potential  split  points  are  taken  into  account. 

Based  on  the  new  compatibility  rough  set  model  introduced  in  section  2,  we 
proposed  a  novel  approach  that  can  generate  reasonable  sized  initial  split  points. 
Genetic  algorithm  is  also  adopted  in  order  to  obtain  optimal  discretization  results. 

2  Compatibility  Rough  Set 

The  standard  rough  set  model  introduced  by  Pawlak,  Z.  [10]  is  based  on  the  equivalence 
relation  on  the  instances.  Many  authors  have  proposed  interesting  extensions  of  the 
initial  rough  set  model[6,7,13,16].  A  common  feature  of  these  extended  models  is  the 
induction  the  unequivalence  relation  on  the  instances  based  on  the  unequivalence 
relation  on  the  attribute  values.  In  this  study,  we  intent  to  introduce  a  compatibility 
rough  set  model  based  on  the  compatibility  relation  directly  on  the  instances  instead  of 
the  attribute  values. 


189 


Let  B  is  a  subset  of  the  full  set  of  the  instances,  BqU  ,  C  =  {ci,*--,c„)  is 

the  set  of  the  conditional  attributes  (assume  C  contains  both  continuous  and  symbolic 
attributes).  Each  instance  corresponds  to  an  attribute  value  vector 
can  construct  a  hyper-region  in  feature  space  based  on  a  group  of  instances  by  using  the 
instances  merging  operation  mj®  •••©«/»  where  ®  denotes  the  merging  operation. 
Actually,  the  instances  merging  operation  is  realized  by  the  attribute  values  merging 
operation  ©  •  •  •  ©  m,  =  { {v, ,  ©  •  •  •  ©  v, , },  •  •  • ,  {v,  ^  ©  •  ■  •  ©  v/  „ } }  ■  For  a  continuous  attribute, 

V,,  ©•••©v/i  =[min(vii,*-,v/|),max(vij,*-,v/i)])  and  for  a  symbolic  attribute, 

V,_,©”-©V,_1 

If  the  instances  in  a  hyper-region  have  identical  decision  label,  then  the  hyper¬ 
region  is  called  pure  hyper-region.  Any  pair  of  instances  uj^uj  in  a  pure  hyper-region  is 
called  compatibility.  The  compatibility  on  uj,uj  defines  a  type  of  binary  relation. 
Clearly,  this  binary  relation  is  reflexive  and  symmetric,  therefore  the  compatibility  on 
the  instances  is  the  compatibility  relation. 

Basing  on  the  compatibility  relation,  we  can  only  get  some  compatibility  classes 
on  the  instances.  The  union  of  all  the  compatibility  classes  can  cover  the  full  set  of 
instances,  but  usually  not  every  compatibility  class  is  required  to  cover  the  set.  A  set  of 
compatibility  classes  is  called  the  simplest  closed  cover,  if  it  can  completely  cover  a  full 
set  of  instances  and  contain  the  least  compatibility  classes.  Because  each  compatibility 
class  can  only  cover  a  proportion  of  a  full  set  of  instances,  the  problem  of  searching  the 
simplest  closed  cover  equals  to  the  problem  of  minimal  set  cover,  it  is  NP-hard.  Next, 
we  will  give  an  greedy  algorithm  that  is  able  to  search  for  an  approximate  solution  of 
the  simplest  closed  cover. 

Let  u/IND{{d))  =  {D^,>“,D,)  is  the  set  of  decision  equivalence  classes  of  a 
decision  table  A  =  {U,Cu{d})y  CP  is  a  compatibility  class,  CPS  is  the  set  of 
compatibility  classes,  TM  is  a  temporary  set. 

Stepl.  CPS  =  [}; 

Step2.  for  i=l  tordo{TM  =  : 

While(Card{TM)^Q) 

{CP  eTM :  TM^TMXu^: 
forj=l  to  Card{TM)  do 

{ iff  CP  is  compatibility  with  uj  e  D,  against  {D^  CPS ) 

then  {cp  =  CP^Uj;TM  =  TM\uj;} 

} 

cps^cpskjCp; 

}  } 

3  Obtaining  the  Initial  Set  of  Split  Points 

The  major  objective  of  introducing  this  compatibility  rough  set  model  is  to  generate  a 


moderate  sized  initial  set  of  split  points,  from  which  the  simplest  set  of  the 
discretization  split  points  will  be  searched.  The  valued  domain  of  each  attribute  is 
divided  into  some  overlapping  valued  intervals  or  valued  set.  For  a  continuous  attribute, 
each  valued  interval  has  two  boundaries.  The  boundaries  with  smaller  value  and  larger 
value  are  called  low  boundary  and  high  boundary,  respectively.  The  initial  split  points 
can  be  determinate  based  on  the  relationship  among  these  boundaries  of  the  valued 
intervals.  The  actual  procedure  is  described  as  follows: 

Let  LBS ~  {/^>; , •  ■  • , }  and  UBS ={ub\,‘”,ub[}  are  two  sets  of  low  boundaries  and  up 
boundaries  of  a  continuous  attribute  C;  eC.  SPSis  the  set  of  split  points,  TJl/is  a 
temporary  set. 

Step!,  SPS^{):TM  =  [}:N  =  (i; 

Step2,  for  j=l  to  q  do 

{if  (there  is  a  =mm{lbi  :ubj  <lb[  ^LBS)) 
then  look  for  a  ubi^i„  =  max{ub^  :  >  ub^  e  UBS} ; 

TM  =  TM^  {(ubU ,  Ib^rrnn  )}  >’ 

Step3.forJ=J  to  N  do 

{sp  =  {lb)  +ub))l2',  /*  ilb'j,ub’j)eTM  */ 

SPS  =  SPSu{sp};} 

Step4.  Repeat  Step2  and  StepS  for  each  C;  e  C . 

4  Selecting  the  Optimal  Set  of  Split  Points 

The  three  main  issues  in  applying  genetic  algorithm  to  any  optimization  problem  are 
the  choice  of  an  appropriate  representation  scheme,  a  fitness  function,  and  the 
initialization  the  chromosome  population.  The  natural  representation  for  optimal 
feature  subset  selection  (OFSS)  is  exactly  the  same  as  the  bit  string  of  length  N 
representing  the  presence  or  absence  of  the  N  possible  binary  attributes. 

The  problem  of  OFSS  is  usually  regarded  as  the  problem  of  searching  state  space. 
Obviously,  the  closer  the  starting  search  states  are  to  the  final  optimal  state,  the  higher 
the  search  efficiency  is.  H.  S.  Nguyen  [9]  had  provided  a  greedy  quick  discretization 
method  by  which  we  can  greedily  construct  some  relative  reducts  of  the  binary 
attributes  (the  population  of  the  initial  chromosomes). 

Shan  [14]  had  given  an  entropy  function  that  can  measure  the  discretization 
complexity. 

m  n 

ffc(D)  =  ~'^p(Dj)^p(q/Dj)\og(Ci/Dj) 

M  i=\ 

Where  p{Dj)  is  the  probability  that  an  instance  belongs  to  decision  equivalence  class 
Dj ,  p{C,lDj)  is  the  probability  that  an  instance  belonging  to  decision  equivalence  class 
Dj  is  matched  by  conditional  equivalence  class  q  . 

In  order  to  measure  the  simplification  (the  number  of  the  binary  attributes)  of  a 
relative  reduct  while  the  discretization  complexity  is  measured,  we  slightly  modify 
above  entropy  function  as  the  fitness  function: 
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H(ClJ  =  Carrf(ClJ-Xp(0,)  j;p(C,  /D,)log/,(C,  /Z),) 

7=1  /=1 

Where  Carz^(Cje„)  is  the  number  of  the  binary  attributes  in  current  chromosome. 

After  certain  specific  genetic  operations,  such  as  cross-over  or  mutation,  the 
offspring  of  two  relative  reducts  may  not  be  a  new  relative  reduct.  However,  being  a 
relative  reduct  is  the  essential  condition  of  the  optimal  subset  of  the  binary  attributes.  In 
order  to  efficiently  detect  whether  a  new  offspring  is  a  relative  reduct,  we  take  the 
simplified  discernibility  factor  set  (SDFS)  modulo  decision  information,  which 
was  proposed  by  T.  Mollestad[ll],  as  a  filter.  Let  q  is  the  full  set  of  the  binary 
attributes  and  is  a  subset  of  ,  if  3^ e (pQCs-c'g  is  true,  then  Cg  is  a 
relative  reduct,  otherwise,  c'g  is  not  a  relative  reduct.  Usually  Card{<i>c)«Carci(U^)y  so 
the  efficiency  of  the  above  set  containing  inquery  is  quit  high.  If  only  the  relative 
reducts  are  forwarded  to  the  fitness  function,  then  we  can  omit  a  lot  of  unnessesary 
fitness  function  calculation. 

5  Experiments  and  Conclusions 

In  order  to  test  the  effectiveness  of  the  discretization  approach  based  on  the 
compatibility  rough  set  model,  an  experiment  is  conducted.  We  selected  nine  data  sets 
that  are  suitable  to  evaluate  the  discretizaiton  methods  from  the  UCI  repository[12].  A 
general-purpose  genetic  algorithm  program  GENESIS  [3]  was  used  as  searching 
engine.  Parameters  for  the  GA  were  set  using  the  default  values  given  in  GENESIS.  In 
the  experiment  procedure,  each  data  set  is  divided  into  two  groups,  sixty  percent  as 
training  set  and  forty  percent  as  testing  set.  AE5  rule  induction  algorithm  [4],  which  is 
based  on  the  theory  of  extension  matrix,  is  used  to  evaluate  the  predictive  accuracy. 
Table  4  shows  the  experiment  results  of  the  nine  selected  data  sets. 

Table  4.  The  comparison  of  the  number  of  split  points  -N.  of  SP,  the  number  of  rules  -N. 
of  R  and  the  predictive  accuracy  of  classifier  between  the  discretization  of  entropy  and 
the  discretization  of  compatibility  rough  set. _ 


Data  Set 

Discretization  of  Entropy 

Discretization  of  Comp.  Rough  Set  | 

N.  ofSP 

N.  ofR 

Accuracy 

N.  OfSP 

N.  OfR 

Accuracy 

Breast 

12 

58 

81.2% 

7 

42 

92.2% 

Diabetes 

49 

44 

64.8% 

19 

39 

69.4% 

Echo 

23 

63 

32.5% 

8 

43 

70.0% 

Glass 

229 

93 

48.4% 

14 

20 

57.8% 

Heart 

41 

60 

63.2% 

12 

16 

73.8% 

Hepatiti 

57 

27 

74.2% 

13 

23 

74.2% 

Iris 

27 

19 

90.1% 

6 

8 

96.1% 

Thyroid 

44 

78 

62.8% 

8 

39 

98.9% 

Wine 

137 

123 

47.2% 

6 

19 

92.5% 

Comparison  of  the  number  of  discretization  split  points,  the  number  of  rules  and  the 
predictive  accuracy  proves  that  the  discretization  method  based  on  compatibility  rough 
set  model  can  significantly  decrease  the  number  of  discretization  split  points  and  the 
number  of  rules,  and  universally  improve  the  predictive  accuracy. 
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Abstract.  A  necessity  measure  N  is  defined  by  an  implication  func¬ 
tion.  However,  specification  of  an  implication  function  is  difficult.  Ne¬ 
cessity  measures  are  closely  related  to  inclusion  relations.  In  this  paper, 
we  propose  an  approach  to  necessity  measure  specification  by  giving  an 
equivalent  parametric  inclusion  relation  between  fuzzy  sets  A  and  B  to 
Na(B)  >  h.  It  is  shown  that,  by  such  a  way,  we  can  specify  a  necessity 
measure,  i.e.,  an  implication  function.  Moreover,  given  an  implication 
function,  an  associated  inclusion  relation  is  discussed. 


1  Introduction 

Possibility  theory  [2]  [8]  has  been  applied  to  many  fields  such  as  approximate  rea¬ 
soning,  data  base  theory,  decision  making,  optimization  and  so  forth.  In  possibil¬ 
ity  theory,  possibility  and  necessity  measures  play  key  roles  to  handle  uncertain 
information,  ambiguous  knowledge  and  vague  concepts.  There  exist  quite  a  lot 
of  possibility  and  necessity  measures  and  the  selection  of  those  measures  quali¬ 
fies  the  properties  of  fuzzy  reasoning,  decision  principles  and  so  on.  Possibility 
and  necessity  measures  should  reflect  the  expert’s  knowledge  and/or  decision 
maker’s  preference.  Between  possibility  and  necessity  measures,  the  selection  of 
a  necessity  measure  is  much  more  important  since  (1)  it  directly  qualifies  fuzzy 
rules  and  possibility  distributions  in  approximate  reasoning  (see  [1])  and  (2)  it 
is  used  for  the  measure  of  safety  or  robustness. 

A  necessity  measure  N  under  a  possibility  distribution  ^jla  (i-e.,  fuzzy  infor¬ 
mation  that  X  is  in  A)  is  defined  as 

Na{B)  =  inf  I{tJ,A{x),lJiB{x))  (1) 

x^X 

by  an  implication  function  I :  [0, 1]  x  [0, 1]  [0, 1]  such  that  /(0, 0)  =  /(0, 1)  = 

7(1, 1)  =  1  and  7(1, 0)  =  0,  where  fiA  and  fis  axe  membership  functions  of  fuzzy 
sets  A  and  B  of  a  universal  set  X.  The  selection  of  a  necessity  measure  means 
that  of  an  implication  function.  In  real  world  situations,  it  is  not  easy  for  us  to 
select  an  implication  function  directly  since  we  are  not  aware  of  what  kind  of 
implication  function  is  used  to  evaluate  the  certainty  degree  of  the  conclusion 
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the  implication  function  itself  is  far  from  our  imagination.  On  the  other  hand,  the 
necessity  measure  is  closely  related  to  the  inclusion  relation  between  A  and  B  as 
is  defined  in  the  crisp  case.  Indeed,  for  several  implication  functions,  equivalent 
conditions  of  Na{B)  >  h  are  known  to  be  inclusion  relations  between  fuzzy  sets 
A  and  B  with  a  parameter  h.  An  inclusion  relation  is  much  more  imaginable 
for  us  than  an  implication  function.  From  this  point  of  view,  we  will  be  able  to 
specify  an  inclusion  relation  with  a  parameter  h  as  an  equivalent  condition  to 
NAiB)  >  h.  Such  a  parametric  inclusion  relation  specification  is  at  least  easier 
than  the  implication  function  specification.  In  this  way,  the  authors  [5]  succeeded 
to  construct  a  necessity  measure,  in  other  words,  an  implication  function,  from 
a  given  inclusion  relation  with  respect  to  Na{B)  >  h  and  proposed  nine  kinds  of 
necessity  measures  defined  by  distinct  inclusion  relations  with  h.  In  this  paper, 
an  inclusion  relation  with  a  parameter  h  with  respect  to  Na{B)  >  h  is  called  a 
‘level  cut  condition’  and  the  approach  to  specify  a  necessity  measure  by  giving 
a  level  cut  condition  is  called  a  ‘level  cut  conditioning  approach’. 

Since  the  level  cut  conditioning  approach  has  not  yet  studied  considerably, 
there  still  remain  open  problems:  (Ql)  Can  we  unify  the  level  cut  conditions 
without  loss  of  rationality  ?,  (Q2)  Is  there  any  level  cut  condition  of  the  necessity 
measure  associated  with  an  arbitrarily  given  implication  function  satisfies  ?, 
(Q3)  Can  any  novel  necessity  measure  be  derived  by  this  approach  ?,  (Q4)  How 
utilize  the  results  of  this  approach  to  real  world  problems  ?  and  so  on.  In  this 
paper,  we  answer  the  questions  (Ql)  and  (Q2).  To  (Ql),  we  give  a  generalized 
level  cut  condition  and  show  the  existence  of  the  necessity  measure  satisfies  the 
condition.  To  (Q2),  we  show  that  a  level  cut  condition  can  be  obtained  when 
a  given  implication  function  satisfies  certain  properties.  On  account  of  limited 
space,  (Q3)  and  (Q4)  are  not  answered  in  this  paper  but  in  our  future  papers. 

2  Necessity  Measures  defined  by  Level  Cut  Conditions 

When  A  and  B  are  crisp  sets,  the  necessity  measure  N  is  uniquely  defined  by 

The  traditional  and  most  well-known  necessity  measure  is  the  one  defined  by 
(1)  with  Dienes  implication  function  /^(o,6)  =  max(l  —  a,  6).  To  this  necessity 
measure,  we  have  (see  [5]) 

N^{B)>h<^(A)i-HC[B]H,  (3) 

where  {A)h  and  [A]h  are  strong  and  weak  h-level  sets  of  A  defined  by 

{A)h  =  {x  I  ftAix)  >  h],  [A]^  =  {x  I  ij,a{x)  >  h).  (4) 

For  necessity  measures  iV®  and  defined  by  (1)  with  Godel  implication 

function  and  reciprocal  Godel  implication  function  r~^  satisfy  (see  [5]) 

N°{B)>h^{A)^C{B)u,  Vk<h, 

N\-°iB)  >h^  [A]i_h  C  Vfc  <  h, 


(5) 

(6) 
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where  and  F  ^  are  defined  by 


1,  if  a  <  6, 
6,  if  o  >  6, 


1,  if  a  <  &, 
1  —  a,  if  a  >  6. 


(7) 


As  shown  in  (3)“{6),  necessity  measures  axe  closely  related  to  set  inclusion  re¬ 
lations.  Moreover,  those  necessity  measures  are  uniquely  specified  by  the  right- 
hand  side  conditions  of  (3) -(6)  since  we  have 


NA{B)=snp{h\NAiB)>h}.  (8) 

h 

From  this  fact,  it  is  conceivable  to  specify  a  necessity  measure  by  giving  a 
necessary  and  sufficient  condition  of  Na{B)  >  h.  From  the  practical  point  of 
view,  giving  such  a  condition  must  be  easier  than  giving  an  implication  func¬ 
tion  directly  to  define  a  necessity  measure,  since  inclusion  relations  are  more 
imaginable  in  our  mind  than  implication  functions.  From  this  point  of  view,  the 
authors  [5]  proposed  level  cut  conditioning  approach  to  define  a  necessity  mea¬ 
sure.  We  succeeded  to  construct  nine  kinds  of  necessity  measures  giving  nine 
different  level  cut  conditions.  In  this  paper,  generalizing  our  previous  results,  we 
discuss  measures  which  satisfy  the  following  condition: 

N^(B)  >h^  mft(A)  C  Mh(B),  (9) 


where  mh{A)  and  Mh{A)  are  fuzzy  sets  obtained  from  a  fuzzy  set  A  by  applying 
a  suitable  parametric  transformation  as  will  formally  be  defined  later.  We  assume 
that  the  inclusion  relation  between  fuzzy  sets  is  defined  normally,  i.e., 


AC  B  ^  Vx  G  at  (10) 


Let  a  fuzzy  set  A  have  a  linguistic  label  a.  Then,  roughly  speaking,  mh(A) 
is  a  fuzzy  set  corresponding  to  a  linguistic  label,  “very  a”,  “extremely  a”  or 
“typically  a”  and  Mh(A)  a  fuzzy  set  corresponding  to  a  linguistic  label,  “roughly 
a” ,  “more  or  less  a”  or  “weakly  a” .  Thus,  (9)  tries  to  capture  that  an  event  ‘x  is 
/?’  expressed  by  a  fuzzy  set  B  is  necessary  to  a  certain  extent  under  information 
that  ‘x  is  a’  expressed  by  a  fuzzy  set  A  if  and  only  if  the  fact  ‘x  is  very  a’  entails 
the  fact  ‘x  is  roughly  Degrees  of  stress  and  relaxation  by  modifiers  ‘very’  and 
‘roughly’  decrease  as  the  necessity  degree  h  increases,  i.e.,  ruh  and  satisfy 

hi>h2=^  rrihAA)  2  rrihM)^  MhM)  ^  (H) 

Now  let  us  define  rtih  and  Mh  mathematically.  mh(A)  and  Mh{A)  are  defined 
by  the  following  membership  functions: 

9>Mh{A){^)  —  9  (12) 

where  functions  :  [0, 1]  x  [0, 1]  [0, 1]  and  :  [0, 1]  x  [0, 1]  [0, 1]  are 

assumed  to  satisfy 
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(gl)  g^{a,  ♦)  is  lower  semi-continuous  and  g^{a,  •)  upper  semi-continuous 
for  all  a  6  [0, 1], 

(g2)  g^{l,  h)  =  g^{l,  /i)  =  1  and  ^”^(0,  h)  =  5^(0,  /i)  =  0  for  all  h>0, 

(g3)  p”^(a,  0)  =  0  and  p^(a,  0)  =  1  for  all  a  G  [0, 1], 

(g4)  hi  >  h2  implies  (a,  hi)  >  g^{a,  h^)  and  g^ {a,  hi)  <  p^(a,  h^)  for 
all  CL  G  [0, 1], 

(g5)  a  >  b  implies  g'^{a^h)  >  g^(b,h)  and  g^{a,h)  >  g^  (b^h)  for  all 
h<l, 

(g6)  (a,  1)  >  0  and  g^{a,  1)  <  1  for  all  a  G  (0, 1). 

(gl)  is  required  in  order  to  guarantee  the  existence  of  a  measure  satisfies  (9)  (see 
Theorem  1).  (g2)  means  that  complete  members  of  a  fuzzy  set  A  are  also  com¬ 
plete  members  of  the  fuzzy  sets  mh{A)  and  Mh{A)  and  complete  non-members 
of  A  are  also  complete  non-members  of  the  fuzzy  sets  mh{A)  and  Mh{A).  This 
implies  that  [mh(A)]i  =  [Mh{A)]i  =  [A]i  and  (m/i(A))o  =  {Mh{A))o  =  (^)o  for 
any  h  >  0.  (g3)  is  required  so  that  the  left-hand  sides  of  (9)  is  satisfied  when 
h  =  0,  (g4)  coincides  with  the  requirement  (11).  (g5)  means  that  the  member¬ 
ship  degrees  of  mh(A)  and  Mh{A)  increase  as  that  of  A  increases.  (g6)  means 
that  all  possible  members  of  A  cannot  be  complete  non-members  of  mh{A)  at 
the  lowest  stress  level,  i.e.,  h  =  I  and  that  all  possible  non-members  of  A  cannot 
be  complete  members  of  Mh{A)  at  the  lowest  relaxation  level,  i.e.,  h  =  1.  As 
described  above,  those  requirements,  (gl)-(g6)  can  be  considered  as  natural. 

It  should  be  noted  that  p”"  and  g^  are  defined  by 


g"'{a,h)  =  T{a,h),  g^{a,h)  =  Cnv[I](a,h)  =  I{h,a),  (13) 


where  T  and  I  are  a  conjunction  function  and  an  implication  function  satisfy 

(11)  7(a,  1)  =  1  and  /(0,a)  =  1,  for  all  a  G  [0, 1], 

(12)  /(a,0)  ==  0,  for  all  a  G  (0, 1], 

(13)  /(•,  a)  is  upper  semi-continuous  for  all  a  G  [0, 1], 

(14)  a<c  and  b>d  imply  I (a,  b)  >  I{c,  d). 

(15)  7(1,  a)  <  1,  for  all  a  G  [0,1). 

(Tl)  T(0,  a)  =  0  and  T(a,  0)  =  0,  for  all  a  G  [0, 1], 

(T2)  T(l,  a)  =  1,  for  all  a  G  (0, 1], 

(T3)  T(a,  •)  is  lower  semi-continuous  for  all  a  G  [0, 1], 

(T4)  a>c  and  b>d  imply  r(a,  b)  >  T{c,  d). 

(T5)  r(a,  1)  >  0,  for  all  a  G  (0, 1]. 

Conversely,  7(a,  6)  =  g^{b,a)  is  an  implication  function  satisfies  (I1)-(I5)  and 
T{a,b)  =  p"^(a,  6)  is  a  conjunction  function  satisfies  (T1)-(T5). 

Remark  1.  In  order  to  express  fuzzy  sets  corresponding  to  a  linguistic  labels  ‘very 
a’  and  ‘roughly  a\  mh{A)  and  MhiA)  should  be  satisfy  mh{A)  C  AC  Mh(A), 
V/i  G  [0, 1].  In  [5],  we  required  so.  However,  to  generalize  the  results  obtained  in 
[5],  we  dropped  this  requirement.  By  this  generalization,  we  can  treat  conditions 
including  h-level  sets.  For  example,  (3)  can  be  expressed  by  (9)  with  definitions, 


9 


m 


1,  if  a  >  1  —  /i, 
0,  otherwise, 


1,  if  a  >  /i, 

0,  otherwise. 
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The  following  proposition  guarantees  the  existence  and  uniqueness  of 
Theorem  1.  exists  and  is  defined  by 

N^{B)  =  sup  {h  I  mh{A)  C  Mh{B)}.  (14) 

0<h<l 

Proof.  Suppose  mh{A)  g  MhiB)  when  supo<A.<i{/!:  |  mk(A)  C  Mk{B)}  >  h. 
From  (11)  and  (12),  there  exists  x  e  X  such  that  g^(fJt,A(x),k)  =  fimkiA){^)  < 
MMfc(B)(^)  —  9^ V/j  <  h  but  p’^(/Xyi(a;), h)  =  Mmh(A)(^)  ^  MMh(B)(^) 
=  This  fact  implies 

liminfp”"(/ZA(a:),/j)  <  9'^{9>A{x),h)  or  limsupi?^(/iB(a;),  A;)  >  9^ {f^B{x),h). 

k—^h  k—¥h 

This  contradicts  the  lower  semi- continuity  of  p^(a,  •)>  Va  e  [0, 1]  and  the  upper 
semi-continuity  of  g^{a,  •),  Va  €  [0, 1].  Therefore,  we  have 

sup{A:  I  irikiA)  C  Mk{B)}  >h=>  AC  Mh(B) 

k<i 

The  converse  is  obvious.  Hence,  we  have  (14).  □ 

The  following  theorem  shows  that  defined  by  (14)  is  a  necessity  measure. 

Theorem  2.  is  a  necessity  measure  and  the  associated  implication  function 
is  defined  by 

I^{a,  b)  =  sup  {h  I  s'”(a,  h)  <  9^{b,  h)}.  (15) 

0<^<1 

Proof  Let  us  consider  defined  by  (15).  From  (g2)  and  (g3),  we  obtain  /^(0,0) 
=  /^(0, 1)  =  /^(l,  1)  =  1  and  /^(1,0)  =  0.  Thus,  is  an  implication  function. 
From  (gl),  we  have 

I^(a,b)>h^9^{a,h)<9^{b,h).  (*) 

Consider  a  measure  ^a{B)  =  /iB(^))*  By  (*),  it  is  easy  to  show 

^A{B)>h^mh{A)CMh{B). 

Thus,  ^a{B)  <  N^{B).  Suppose  ^a(B)  <  N^iB)  =  h*.  Then  there  exists  an 
X*  e  X  such  that  I^{fiA{x*),fJ>B(x*))  <  h*.  From  (*),  we  have  (a:*),  h*)  > 

9^{fiB{x*)yh*).  Prom  (12),  mh*{A)  g  Mh*{B).  Prom  (9),  we  obtain  N\(B)  < 
h*.  However,  this  contradicts  N\(B)  >  h*.  Hence,  N\{B)  =  ^a(B)>  □ 

It  is  worth  knowing  the  properties  of  in  order  to  see  the  range  of  implica¬ 
tion  functions  defined  by  level  cut  conditions. 

Proposition  1.  defined  by  (15)  satisfies  (II),  (14),  (15)  and 

(16)  7(a,0)  <  1,  for  alia  e  (0,1]. 
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Moreover,  satisfies  (12)  if  and  only  if  g^{a,h)  >  0  for  all  {a,h)  >  (0,0),  and 
(17)  if  and  only  if  g^{a,  h)  <  1  for  all  a  <  1  and  h>  0,  where 

(17)  7(1,  a)  =  0,  for  all  a  G  [0, 1). 

Proof.  Except  for  (15)  and  (16),  all  the  properties  are  straightforward  from  (gl)- 
(g6).  Form  (g2),  we  obtain 

/^(l,o)=  sup  {ft  1  3^(a,/j)  >  1},  J^{a,0)=  sup  {ft  |  ff"'(a,ft)  <  0}. 

0<h<l  0<fe<l 

Prom  (gl)  and  (g6),  we  have  (15)  and  (16).  □ 

It  should  be  noted  that  infinitely  many  different  pairs  of  produce  the 

same  necessity  measure  as  far  as  the  level  cut  condition,  or  simply,  the  condition 
g‘^{a,h)  <  g^{b,h)  is  equivalent. 

Example  1.  Let  V {A)  be  the  truth  value  (1  for  true  and  0  for  false)  of  a  statement 
A.  When  g'^{a,h)  =  mm{a,V{h  >  0))  and  g^{a,h)  =  max{a,V{a  >  h)), 
is  Godel  implication  When  g^{a,h)  =  mm{a,V{a  >  h))  and  g^{a,h)  = 
max(a,V{h  =  0)),  is  reciprocal  Godel  implication  P~^. 


3  Level  Cut  Conditions  Derived  from  Necessity  Measures 


In  this  section,  given  a  necessity  measure,  or  equivalently,  given  an  implication, 
we  discuss  how  we  can  obtain  the  functions  g^  and  g^ .  To  do  this,  requirements 
(gl)-(g6)  are  not  sufficient  to  obtain  some  results.  We  add  a  requirement, 

(g7)  g^{',h)  is  lower  semi-continuous  and  g^{',h)  upper  semi-continuous 
for  all  h  G  [0, 1]. 

and  assume  that  g"^  and  g^  satisfy  (gl)-(g7).  Together  with  (gl),  this  additional 
requirement  guarantees  that  satisfies  (13)  and 


(18)  7(a, ')  is  upper  semi-continuous. 

First,  we  look  into  the  properties  of  pseudo-inverses  of  g'^{‘,  h)  and  g^{',  h) 
defined  by 


(a,  h)  =  sup  {6  I  p^(6,  h)  <  a}, 
0<6<1 


inf  {b 
o<b<i 


g^ib,h)  >  a}. 


We  have  the  following  propositions. 


(16) 


Proposition  2.  g^*{-,h)  is  upper  semi-continuous  and  g^*{',h)  lower  semi- 
continuous,  for  all  h  G  [0, 1]. 

Proof  When  h  =  0,  it  is  obvious.  When  h  >  0,  from  the  definition  and  (g5),  we 
have 

{a|p™*(a,ft)>5*}=  fl  {a|a>fl’"(6,ft)}. 

0<6<6* 

This  set  is  closed  and  hence,  h)  is  upper  semi-continuous.  The  lower  semi¬ 
continuity  of  g^*(-,  h)  can  be  proved  similarly.  □ 
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Moreover,  from  (gl),  we  can  prove  the  upper  semi-continuity  of  •)  and 

the  lower  semi- continuity  of  g^*(a,  •). 

Proposition  3.  We  have 

g^{a,  h)  <  g^ib,  h)  ^  a  <  g^*{g^{b,  h),h),  (17) 

g^ia,h)  <  g^(b,h)  ^  g^*(g'^ia,h),h)  <  b.  (18) 

Proof.  From  the  definition  of  g^*^  we  have  a  <  g'^*{g‘^{a,h),h)  and  g^*{-,h) 
is  non-decreasing.  Thus,  we  have 

fl">(a,  h)  <  g^ib,  h)  =>  a  <  fl'"*(9’"(a,  h),h)  <  g’^^g^ib,  h),h). 

On  the  other  hand,  from  the  upper  semi-continuity  of  g'^*{'ih),  we  can  easily 
prove  h),/i)  <  a.  Thus,  from  (g6),  we  obtain 

a  <  g”'^g^{b,h),h)  =>  g”'{a,h)  <  g”'{g’^*{g'^(b,h),h),h)  <  g^{b,h). 

Hence,  (17)  is  valid.  (18)  can  be  proved  in  the  same  way.  □ 

Proposition  3  gives  an  expression  of  other  than  (15),  i.e., 

I^{a,b)=  sup  {h\a<g"''‘ig^(b,h),h)},  if  g”'(-,h)  is  I  s.  c.,  (19) 

0<h<l 

I^(a,b)  =  sup  {h  I  g^*{g^ia,h),h)  <  6},  iig^{-,h)  is  u.  s.  c.  (20) 

0<h<l 

Moreover,  it  should  be  noted  that  /”^*(a,6)  =  g'^*{g^ {b,a),a)  is  an  implica¬ 
tion  function  which  satisfies  (II),  (13)  (14),  (15),  (16)  and  (18)  and  T^*{a,b)  — 
g^*{g^{a^b),b)  is  a  conjunctive  function  which  satisfies  (Tl),  (T3),  (T4),  (T5), 
(T6)  and  (T7): 

(T6)  T(l,a)>0,  forallaG(0,l], 

(T7)  T{',a)  is  lower  semi-continuous. 

This  fact  together  with  (19)  and  (20)  remind  us  functionals  a  and  C  both  of  which 
yield  an  implication  function  from  an  implication  function  I  and  a  conjunction 
function  T,  respectively  (see  [4] [6] [7]),  i.e., 

a[I]{a,b)  =  sup  {h  \  I{h,b)  >  a},  C[T]{a,b)  =  sup  {h  \  T(a,h)  <  b}.  (21) 

Q<h<l  0<h<l 

Under  assumptions  that  I  satisfies  7(1,0)  =  0,  (II),  (13),  (16)  and  (I4-a): 
/(a,  c)  >  7(6,  c)  whenever  a  <  6,  we  can  prove  a[a-[7]]  =  7  and  a[I]  preserves 
those  properties  (see  [4] [7]).  Under  the  assumptions  that  T  satisfies  ^(l,  1)  =  1, 
(Tl),  (T3),  (T6)  and  (T4-b):  T{a,b)  <  T(a,c)  whenever  6  <  c,  we  have  ^[([T]]  ~ 
T  and  C[r]  satisfies  7(1,0)  =  0,  (II),  (15),  (18)  and  (I4-b):  7(a,6)  <  7(a,c) 
whenever  b  <  c  (see  [6]  [7]),  where 

^[7](a,6)  =  ^mf  J/i  |  7(a,/i)  >  b}. 


(22) 
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Moreover,  under  the  assumptions  that  I  satisfies  1(1,0)  =  0,  (II),  (15),  (18)  and 
(I4-b),  we  have  CW]]  =  /  and  $[/]  satisfies  r(l,l)  =  1,  (Tl),  (T3),  (T6)  and 
(T4-b). 

Under  the  assumptions  (gl)-(g7),  pairs  (7^,/^*)  and  satisfy  the 

requirements  for  (ct[/L]  =  and  =  I^)  and  (^[7^]  =  and  C[T^*]  = 

7^),  respectively-  Hence,  given  an  arbitrary  implication  function  7  which  satis¬ 
fies  (II),  (I3)-(I6)  and  (18),  we  obtain  =  a[I]  and  T^*  =  ^[7].  One  may 
think  that,  defining  9^{a,h)  =  a  or  9^{a,h)  =  a  when  h  >  0  so  that  we  have 
h)  =  a  OT  g^*{a,  h)  =  a  for  all  /i  >  0,  we  can  obtain  g^  and  g^  via  (13) 
with  substitution  of  or  T^*  for  7  and  T.  However,  unfortunately,  there  is 
no  guarantee  that  such  g^  and  g^  satisfy  (g3).  The  other  properties,  (gl),  (g2), 
(g4)-(g7),  are  satisfied  as  is  shown  in  the  following  proposition. 

Proposition  4.  a  preserves  (12),  (U)  and  (18).  Moreover,  a  preserves  (15) 
under  (13).  On  the  other  hand,  when  I  satisfies  (I4-a>),  ^[7]  satisfies  (T4).  When 
I  satisfies  (16)  and  (18),  ^[7]  satisfies  (T5).  When  I  satisfies  (17),  ^[7]  satisfies 
(T2).  When  I  satisfies  (13),  $[7]  satisfies  (T7). 

From  Propositions  1  and  4  together  with  (13),  the  following  theorem  is 
straightforward. 

Theorem  3.  Let  I  satisfy  (II),  (I3)-(I6)  and  (18).  When  I  satisfies  (12), 

=  9^{a,h)=a[I]{h,a),  (23) 

satisfy  (gl)-(g7).  On  the  other  hand,  when  I  satisfies  (17), 

satisfy  (gl)-(g7). 

Theorem  3  gives  an  answer  to  (Q2)  when  7  satisfies  (12)  or  (17)  as  well  as 
(II),  (I3)-(I6)  and  (18).  The  complete  answer  to  (Q2)  is  rather  difiicult  since 
decompositions  of  and  T^*  to  g'^  and  g^  satisfying  (gl)~(g7)  are  not  easy. 
In  what  follows,  we  give  other  answers  under  certain  conditions.  The  following 
proposition  plays  a  key  role. 

Proposition  5.  When  g^  is  continuous,  we  have  g^{g^*{a,h),h)  =  a,  for  all 
h  e  (0, 1].  Similarly,  when  g^  is  continuous,  we  have  g^ ig^*{a,h),h)  =  a. 

Proof  Because  of  (g2),  the  continuity  of  p"^(-,/i)  and  g^{-,h)  implies  that 
g^{',h)  and  g^{-,h)  are  surjective  for  all  h  e  (0,1],  respectively.  Hence,  we 
have  g^*{a,h)  ~  supo<6<i{6  |  g'^{h,h)  =  a]  and  g^*{a,h)  =  info<6<i{fe  | 
g^{b,  h)  =  a}.  Because  of  continuity,  sup  and  inf  can  be  replaced  with  max  and 
min,  respectively.  Hence,  we  have  the  theorem.  □ 

Moreover,  the  following  proposition  is  straightforward. 
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Proposition  6.  If  a  given  function  g'^  :  [0, 1]  x  [0, 1]  [0, 1]  satisfies  (gl)-(g6), 

(g8)  is  continuous  for  all  h  £  (0, 1], 

(g9)  p”^(a[/](a,0),a)  =  0  for  all  a  e  (0, 1], 

(glO)  g^{a,  1)  <  1  for  all  a  G  [0, 1), 

then  r(a,h)  =  g^{a[I]ia,b),b)  satisfies  (IlHIS),  (15)  and  (18).  Similarly,  if  a 
given  function  g^  :  [0, 1]  x  [0, 1]  [0, 1],  such  that  g^  satisfies  (gl)-(g6), 

(gll)  g^{-,h)  is  continuous  for  all  h  G  (0,1], 

(gl2)  p^(^[/](l,a),a)  =  1  for  all  a  G  (0,1], 

(gl3)  ^^(a,l)>0/oranG(0,l], 

then  T*{a,b)  =  g^ {^[I]{a,b),b)  satisfies  (T1)-^(T3),  (T5)  and  (T7). 

From  Propositions  5  and  6,  if  we  find  a  function  g'^  (resp.  g^)  satisfies  (gl)- 
(g6)  and  (g8)-(gl0)  (resp.  (gll)~(gl3))  and  r{a,b)  =  g^{a[I]ia,b),b)  (resp. 
T*{a,  6)  =  g^ (^[/](a,  b),b))  satisfies  (14)  (resp.  (T4)),  then  the  pair  {g^,Cnv[P]) 
(resp.  iT*,g^ ))  is  an  answer  to  (Q2)  with  respect  to  a  given  implication  function 
/,  where  Cnv[r]  is  the  converse  of  an  implication  function  I*,  i.e.,  Cnv[I*]{a,  b) 
=  P(b,a). 

From  Propositions  3  and  4,  the  necessary  and  sufficient  condition  of  Na{B)  > 
h  becomes  (a)  //^^(x)  <  a[I]{h,  ij,b{x)),  Vx  G  X,  or  (b)  ^[/](//A(a^)j  ^  Mb(^)) 

Vx  G  X.  As  can  be  seen  easily,  (a)  if  and  only  if  max{iiA{x),(T[I]{h,0))  < 
max(cr[7](/i,  (x)),  (7[/](/i,  0)),  Vx  G  X  and  (b)  if  and  only  if  min(^[/](//^(x),  /i), 

4[/](l,/i))  <  Prom  this  fact,  giving  bijective  and  strictly 

increasing  functions  p{-,h)  :  [cr[7](/i, 0),  1]  ->  [0,1]  and  q{',h)  :  [0, ^[/](1,  h)]  -> 
[0, 1],  we  may  define  g^{a,  h)  =  p(max(a,  CT[I]{h,  0)),h)  for  case  (a)  and  g^(a,  h) 
=  q{min{a,^[I]{l,h)),h)  for  case  (b)  under  certain  conditions.  Prom  this  point 
of  view,  we  have  the  following  theorem. 

Theorem  4.  We  have  the  following  assertions: 

1.  Let  p{-,h)  :  [cr[/](/i,  0),  1]  [0,1]  be  a  bijective  and  strictly  increasing  func¬ 

tion  such  that 

(pi)  hi  >  /i2  implies  p{a,  hi)  >  p{a,  /12)  for  all  a  G  [cr[I]{h2, 0),  1], 

(p2)  p(max(a[/](/i,a),cr[/](/i,0)),/i)  is  non- decreasing  in  h, 

(p3)  a[/](*,0)  is  continuous, 

then  g^{a,h)  =  p{mdx{a,cT[I]{h,0)),h)  and  g^{a,h)  =  p{max{a[I]{h,a), 
a[I]{h,0)),h)  satisfy  (gl)-(g7)  and  define  the  level  cut  condition. 

2.  Letq(‘,h)  :  [0,  ^[/](1, /i)]  ->  [0,1]  be  a  bijective  and  strictly  increasing  func¬ 
tion  such  that 

(ql)  hi  >  h2  implies  q{a,hi)  <  ^(a,  ^2)  for  all  a  G  [0,^[7](l,/i2)], 

(q2)  g(min(^[7](a,/i),C[7](l,/i),/i)  is  non-increasing  in  h, 

(q3)  C[7](l,-)  is  continuous, 

then  g^{a,h)  =  q{jnm{^[I](a,h),^[I]{l,h)),h)  and  g^{a,h)  =  g(min(a, 
^[I]{l,h)),h)  satisfy  (gl)-(g7)  and  define  level  cut  condition. 

Proof.  Prom  Proposition  6,  it  is  obvious.  D 
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Table  1.  and  for  and  7""^ 


7 

m 

9”*(a.fe)  (ft>0) 

g^{a,h)  (/i>0) 

wm 

max(0, 1  -  f(a)lf{n(h))) 

min(l,  f{n{a))lf{n{h))) 

m 

<  -foo 

max(0, 1  -  /(o)/(/(0)  -  /(ft))) 

min(l.(/(0)-/(a))/(/(0)-/(A))) 

=  -j-oo 

a 

/  '(max(0,/(a) -/(6))) 

<  -foo 

(/Ka))-/(ft))/(/(0)-/(h)) 

min(l,  f(n{a})/(f{0)  - 

=  -foo 

n{f  ‘(max(0,/(n{o))-/(/i))) 

a 

Prom  Theorem  4,  we  can  obtain  the  level  cut  condition  of  a  given  neces¬ 
sity  measure  when  we  find  suitable  functions  p  and  q.  Table  1  shows  the  level 
cut  condition  for  S-,  R-  and  reciprocal  R-implication  functions  of  a  continu¬ 
ous  Archimedean  t-norm  t  and  strong  negation  n,  A  continuous  Archimedean 
t-norm  is  a  conjunction  function  which  is  defined  by  t{a,b)  =  f*{f{a)  +  /(&)) 
with  a  continuous  and  strictly  decreasing  function  /  :  [0, 1]  -)■  [0,  -hoc)  such 
that  /(I)  =  0,  where  f*  :  [0,  +oo)  — )■  [0, 1]  is  a  pseudo-inverse  defined  by 
f*{r)  =  sup{/i  I  f{h)  >  r}.  A  strong  negation  is  a  bijective  strictly  decreasing 
function  n  :  [0, 1]  [0, 1]  such  that  n{n{a))  =  a.  Given  a  t-norm  t  and  a  strong 

negation  n,  the  associated  S-implication  function  7^,  R-implication  function  7^ 
and  reciprocal  R-implication  7'^”^  are  defined  as  follows  (see  [3]): 

/®(a,6)  =  n(t(o,n(6))),  I^{a,b)  =  C[t](o,6),  r~^{a,b)  =  C[t](n(6),n(a)). 

(26) 
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Abstract.  Four  methods  of  c-regression  are  compared.  Two  of  them  are 
methods  of  fuzzy  clustering:  (a)  the  fuzzy  c-regression  methods,  and  (b) 
an  entropy  method  proposed  by  the  authors.  Two  others  are  probabilistic 
methods  of  (c)  the  deterministic  annealing,  and  (d)  the  mixture  distri¬ 
bution  method  using  the  EM  algorithm.  It  is  shown  that  the  entropy 
method  yields  the  same  formula  as  that  of  the  deterministic  annealing. 
Clustering  results  as  well  as  classification  functions  are  compared.  The 
classification  functions  for  fuzzy  clustering  are  fuzzy  rules  interpolat¬ 
ing  cluster  memberships,  while  those  for  the  latter  two  are  probabilistic 
rules.  Theoretical  properties  of  the  classification  functions  are  studied. 
A  numerical  example  is  shown. 


1  Introduction 

Recent  studies  on  fuzzy  clustering  revealed  that  there  are  new  methods  [5, 7-9] 
based  on  the  idea  of  regularization.  These  methods  are  comparable  with  the 
fuzzy  c-means  [1, 3]  and  their  variations.  The  c-regression  model  is  well-known 
among  the  variations,  namely,  Hathaway  and  Bezdek  have  developed  the  method 
of  fuzzy  c-regression  [4].  It  is  not  diflficult  to  show,  as  we  will  see  below,  that  the 
new  methods  have  variations  that  are  applied  to  the  c-regression  model. 

Another  class  of  methods  that  may  compete  with  the  fuzzy  c-means  is  the 
mixture  distribution  model  [10]  with  the  EM  algorithm  [2]  for  the  calculation 
of  solutions.  This  method  is  based  on  the  statistical  model  and  hence  the  two 
frameworks  of  fuzziness  and  statistics  are  different.  Hathaway  and  Bezdek  [4] 
mention  a  simple  type  of  the  mixture  model  for  the  c-regression. 

Moreover  a  method  of  deterministic  annealing  has  been  proposed  [11]  that 
is  also  based  on  probability  theory.  This  method  uses  the  Gibbs  distribution 
for  determining  probabilistic  allocation  of  clusters  with  the  heuristic  method  of 
using  the  means  for  centers.  (See  also  Masulli  et  al.  [6].) 
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In  addition  to  the  new  methods  of  fuzzy  c-means,  we  have  introduced  clas¬ 
sification  functions  that  interpolate  the  membership  values  of  the  individuals  to 
each  cluster.  Global  characters  of  clusters  are  accordingly  made  clear  by  using  the 
classification  functions  [7-9].  The  concept  of  classification  functions  is  applica¬ 
ble  to  the  c-regression  model.  In  contrast,  classification  functions  in  probabilistic 
models  are  derived  from  probabilistic  rules  such  as  the  Bayes  rule. 

We  have  thus  four  methods  for  the  clustering  with  regression:  the  two  fuzzy 
models  and  the  other  two  probabilistic  models.  These  methods  should  now  be 
compared  in  theoretical,  methodological,  and  applicational  features.  In  doing 
this,  we  can  use  classification  functions. 

In  the  following  we  first  review  the  fom:  methods  briefly,  and  develop  the 
algorithms  for  calculating  solutions  that  are  not  shown  in  foregoing  works.  The 
algorithm  for  calculating  solutions  for  the  entropy  method  developed  by  the 
authors  is  shown  to  be  equivalent  to  the  method  of  deterministic  annealing, 
although  the  two  models  are  different.  Theoretical  properties  of  the  classification 
functions  are  compared.  A  numerical  example  is  shown  to  see  differences  in 
clustering  results  and  classification  functions. 


2  Fuzzy  methods  and  probabilistic  methods 

2.1  Two  methods  of  fuzzy  c-regression 

A  family  of  methods  on  the  basis  of  fuzzy  c-means  have  been  developed;  there  are 
common  features  in  the  methods.  First,  an  alternative  optimization  algorithm  is 
used  to  find  solutions.  Second,  objective  functions  for  clustering  have  a  common 
form. 

Let  Xi  =  i  =  l,...,n  be  individuals  to  be  clustered.  They 

are  points  in  p-dimensional  Euclidean  space.  We  consider  two  t3^es  of  objective 
functions: 


J2  =  '^^UikDik  + 

i=l  fe=l  i=l  k=l 

where  Uik  is  the  element  of  cluster  membership  matrix  U  =  (uik).  The  constrmnt 
of  the  fuzzy  partition  M  =  {(/j  <1, *  =  !)•••)  c,  fc  = 

1, . . . ,  n}  is  assumed  as  usual. 

The  term  Dik  varies  in  accordance  with  the  types  of  clustering  problems.  In 
the  standard  fuzzy  c-means,  J\  is  used  with  Dik  =  W^^k  ~  square  of 

the  Euclidean  distance  between  the  individual  Xk  and  the  center  of  the  cluster 
i,  while  J2  is  used  with  the  same  Dik  in  the  method  of  entropy  proposed  by  the 
authors  [7]. 

Since  we  consider  c-regression  models,  Dik  =  iVk  —fii^^k]  M)  Is  used  instead 
of  \\xk  -  Vif.  Remark  that  the  set  of  data  has  the  form  of  {xk,yk),  i  =  1, .  •  • ,  n, 
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where  x  is  the  p-dimensional  independent  variable,  while  2/  is  a  scalar  dependent 
variable.  We  wish  to  find  a  function  fi{x\  pi)  of  regression  by  choosing  parameters 
Pi  so  that  the  objective  functions  are  minimized.  Among  possible  choices  for  /», 
the  linear  regression  is  assumed: 

=  =  +  (1) 

i=i 

whence  Pi  =  {P}^...^ Pl^^)  is  a p  +  1  dimensional  vector  parameter. 

The  term  Dik  is  thus  a  function  of  Pi'.  Dik{Pi)  ~  \y  -  Pf^^  ~  Pi'^^l^- 
Let  B  =  (/?i, . . . , /9c),  we  can  express  the  objective  functions  as  the  function  of 
U  and  B: 

Ji  {U,  B)  =  ^  (A)  (2) 

i=l  ik=l 

c  n  c  n 

»=1  A:=l  i=l  k~l 

Finally,  we  note  that  the  following  alternative  optimization  algorithm  is  used 
for  finding  optimal  U  and  B,  in  which  either  J  =  Ji  or  J  =  J2. 


Algorithm  of  fuzzy  c-regression 
R1  Set  initial  value  for  B.  _ 

R2  Find  optimal  solution  U:  J{U^B)  =  J(17,B) 

R3  Find  optimal  solution  B:  J{U,B)  =  min  J(U^B) 

^gR(p+l)c 

R4  Check  stopping  criterion  and  if  convergent,  stop.  Otherwise  go  to  R2. 

Assume  J  =  Ji  (the  standard  fuzzy  c-regression  method  is  used).  It  is  then 
easy  to  see  that  the  solution  in  R2  is 


Uik  = 


n  -1 


while  the  solution  pi  in  R3  is  obtained  by  solving 


\fe=l  /  fc  =  l 


with  respect  to  Pu  where  Zk  =  (arj^, . . . ,  1)^. 

If  we  use  J  =  J2,  we  have 


■XDik 


VkZk 


0i=('f^UikZkzi\ 

\fe=l  /  fc=l 


Uik  = 


i=i 

in  R2  and  R3,  respectively. 


UikVk^k 


(4) 

(5) 

(6) 
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2.2  Mixture  distribution  model  for  c-regression 


The  model  of  the  mixture  of  normal  distribution  is  another  useful  method  for 
clustering  with  the  EM  algorithm  [2, 10].  It  is  based  on  the  statistical  concept  of 
maximum  likelihood  but  the  results  are  comparable  with  those  by  fuzzy  c-means. 
Application  of  this  model  to  the  c-regression  has  been  mentioned  by  Hathaway 
and  Bezdek  [4]. 

We  simply  describe  an  outline  of  the  algorithm.  Notice  first  that  the  model 
assumes  that  the  distribution  of  the  error  term  ei  —  y  - 
Gaussian  with  the  mean  0  and  the  standard  deviation  ai  which  is  to  be  estimated 
in  the  algorithm. 


The  distribution  is  hence  assumed  to  be 

c 

=  ^OiiPi{x,y\Ci), 

(^ai  =  l) 

(7) 

i=l 

i=l 

P.(x,s/|C,)-^^^exp(  ^(y 

j=l 

(8) 

The  parameters  (j>i  =  {au  cr^pi)  (i  =  I, , . c)  should  be  estimated  by  the  EM 
algorithm.  For  simplicity,  let  ^  . . .,  (I>c)  st-nd  assume  that  p(a;,  y\^)  is  the 

density  with  the  parameter 

Let  an  estimate  of  the  parameters  be  0'  (i  =  1, .  • . ,  c),  then  the  next  estimate 
ai  by  the  EM  algorithm  is  given  as  follows. 


Oi  = 


n 


1.  ^  0iiPi{xk,ykWi) 
'^k^i  Pi^k,yk\<^') 


C, 


where 


a’fPi(xk,ykl<^'i)  ,r,  , 


fc=l 


Pi  is  obtained  from  solving  the  following  equation: 


p  n 


+  (52 —  52  •  •  •  >p) 

fe=l  fe=l  k=l 

(52^^^)^^"^^  =^^'^ikyk 

j=l  k=l 


k=l 


k=l 


and  finally  we  have 


k=l 


J=1 
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Remark  moreover  that  the  individual  (xk^Vk)  is  allocated  to  the  cluster  Ci 
using  the  Bayes  formula.  Namely,  the  probability  of  the  allocation  is  given  by 

p{Ci\x,,yk)  =  (9) 

Y,ajp(xk,Vk\<i>}) 

i=i 


2.3  Deterministic  annealing 

Gibbs  distribution  is  used  for  probabilistic  rule  of  allocating  individuals  to  clus¬ 
ters  [11].  Namely,  the  following  rule  is  used: 

-p\\x-yif 

Pr{x  e  Ci)  =  -4 -  (10) 

in  which  p  is  a  parameter  and  yi  is  the  cluster  representative.  For  given  yi 
(i  =  1, . . . ,  c),  Pr{x  e  Ci)  is  determined  as  above,  then  the  cluster  representative 
is  calculated  again  by  the  average: 


y^g;  •  Pr(x  e  Ci) 


Vi 


J^PrixeCi) 

X 

Iterations  of  (10)  and  (11)  until  convergence  provide  clusters  by  this  method. 


(11) 


2.4  Deterministic  annealing  and  entropy  method 


It  is  now  straightforward  to  use  the  deterministic  annealing  for  the  c-regression. 
We  have 


^-pDik{0i) 

7^ik  =  Pr{(xk,  yk)  e  Ci)  =  -  (12) 


^^-pDjkWi) 


while 


0i  =  ( '^nikZkZ^  j 

\»!=1  /  fc=l 


(13) 


(Remark:  Detailed  proof  is  omitted  here  to  save  the  space.) 

Now,  readers  can  see  that  the  method  of  entropy  and  the  deterministic  an¬ 
nealing  provide  the  equivalent  solutions  by  putting  p  =  A,  although  the  models 
are  different;  the  entropy  method  is  a  fuzzy  model  and  is  based  on  the  alternative 
optimization,  while  the  deterministic  annealing  is  a  probabilistic  model  and  an 
objective  function  to  be  optimized  is  not  assumed.  Although  we  have  shown  this 
equivalence  in  the  case  of  c-regression,  the  same  argument  is  applicable  to  the 
c-means  and  to  other  variations  of  the  c-means. 
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3  Classification  Functions 


Classification  functions  in  fuzzy  c-means  and  the  entropy  method  have  been 
proposed  and  studied  in  Miyamoto  and  Mukaidono  [7] .  This  idea  can  be  applied 
to  the  c-regression. 

The  classification  function  in  fuzzy  clustering  means  that  a  new  generic  ob¬ 
servation  should  be  allocated  to  each  cluster  with  the  membership  defined  by 
that  function,  hence  the  function  should  have  x  and  y  as  independent  variables 
in  this  case  of  c-regression. 

In  analogy  to  the  fuzzy  c-means,  classification  functions  in  fuzzy  c-regression 
is  defined  by  replacing  Xk^Vk  by  the  corresponding  variables  x  =  . . . ,a:^) 


and  y: 


(14) 


and  when  p 

i=i 


(15) 


for  a  particular  i,  the  corresponding  U}  (x,  j/)  =  1  and  U} (x,  y)  —  0  tor  £  ^  i. 
In  the  case  of  the  entropy  method,  we  have 


y) 


P+l[2 


£=1 


(16) 


Remark  that  the  values  of  the  parameters  ft,  (i  =  l,...,c)  are  obtwned 
when  the  corresponding  algorithm  of  clustering  terminates. 

As  shown  above,  the  classification  function  for  the  deterministic  annealing  is 
equivalent  to  that  by  the  entropy  method.  Thus,  Pr{{x^y)  €  Ci)  =  Uf{x^y)  by 
putting  \  —  p. 

For  the  mixture  distribution  model,  the  same  idea  is  applied.  Namely,  we 
can  define  a  classification  function,  or  a  discrimination  function  by  replacing  the 
symbols  Xk,  yk  by  the  variables  x  and  y.  We  thus  have 


Up{x,y)=p{Ci\x,y)  = 


oiiP(xyy\<t>i) 

c 

'^atp(x,y\(t>t) 

e=i 


(17) 


in  which  the  parameter  ^  is  obtained  from  the  application  of  the  EM  algorithm. 
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Comparison  of  classification  functions 

Some  of  the  theoretical  properties  of  classification  function  are  easily  proved. 
First,  notice  that  the  maximum  value  Ul{x^y)  =  1  is  attained  at  the  points 
where  (15)  is  satisfied.  On  the  other  hand,  when 

|y  -  I  ^  +CX) 

j=i 

for  all  1  <  ^  <  c,  we  have  Ul{x^ 

We  next  examine  the  classification  function  of  the  entropy  method  (and 
equivalently,  we  are  examining  the  function  for  the  deterministic  annealing). 
It  should  be  remarked  that  the  maximum  value  of  Uf{x^  y)  is  not  necessarily  at 
the  point  satisfying  (15). 

It  is  easily  seen  that  the  value  Uf{x^  ^)  =  1  cannot  be  attained  for  a  particular 
(x,j/)  in  contrast  to  U}.  Instead,  we  have  lim  Uf{x^y)  =  1  for  some  i  and 

J/— +00 

an  appropriately  chosen  x  =  x.  Whether  this  property  holds  or  not  depends 
upon  the  relative  positions  of  the  vectors  (/?!,...,  /?f )  £  R^,  i  =  1, . . . ,  c.  Notice 
that  is  not  included  in  the  discussion  below.  Hence  another  vector  $i  = 
(/5l , . . . ,  )  is  used  instaed  of  fSi. 

For  the  following  two  propositions,  the  proofs  are  not  difficult  and  are  omitted 
here.  The  first  proposition  formally  states  the  above  result. 

Proposition  1.  If  there  exists  an  open  half  space  S  C  such  that 

{A-A  :  l<i<cJ^i}cS  (18) 

then  there  exists  x  EBP  such  that  lim  l/hx^y)  =  1.  If  such  a  half  space  does 

y-^oo 

not  exist,  then  for  all  x  €  R^,  lim  Uf{x,  y)  =0. 

y— t-oo 

A  condition  for  the  existence  of  S  in  the  Proposition  1  is  the  following. 

Proposition  2.  Let  T  =  span0i, . . ,  y0c}  and  CO{/3i, . .  .,y0c}  be  the  convex 
hull  in  T  generated  by  Assume  that  {ft, . . . , ^c}  is  independent. 

Then  a  condition  for  the  existence  of  S  for  a  particular  i  such  that  (18)  holds  is 
that 

ft  ^int(CO{ft,...,ft}). 

In  other  words,  such  a  S  exists  if  and  only  if  the  vertex  of  ft  is  not  in  the  interior 
of  the  above  convex  hull. 

We  thus  observe  that  the  maximum  value  is  not  at  the  points  of  (15)  in 
the  entropy  method.  Analogous  results  hold  for  the  classification  function  17/^ 
of  the  mixture  distribution  model,  but  propositions  of  the  above  type  cannot 
be  derived  since  the  function  is  too  complicated.  Nevertheless,  it  is  easily  seen 
that  the  form  of  the  classification  function  UP  becomes  equivalent  as  11^  in  a 
particular  case  of  ai  =  •  ■  •  =  ctc  and  ai  =  •  •  •  =  ttc-  We  therefore  expect  that 
the  maximum  value  is  not  at  the  points  of  (15)  in  the  mixture  model  also. 
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4  A  Numerical  Example 

Figure  1  shows  an  artificial  example  of  a  set  of  points  with  two  regression  lines. 
The  lines  in  the  figure  have  been  obtained  from  the  fuzzy  c-regression,  but  no 
remarkable  differences  have  been  observed  concerning  the  regression  lines  derived 
from  the  four  methods. 

Figure  2  shows  the  three-dimensional  plot  of  the  classification  function  for  one 
cluster  by  the  fuzzy  c-regression,  whereas  Figure  3  depicts  the  plot  of  p{Ci  \x,  y) 
by  the  mixture  distribution  model.  Readers  can  see  remarkable  difference  be¬ 
tween  these  two  classification  functions.  The  classification  function  by  the  en¬ 
tropy  method  (and  the  deterministic  annealing)  in  Figure  4  is  similar  to  that  in 
Figure  3. 


Fig.  3.  Classification  function  by  mix¬ 
ture  distributions 


Fig.  4.  Classification  function  by  en¬ 
tropy  method 
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5  Conclusion 

Four  methods  of  c-regression  have  been  considered  and  classification  functions 
have  been  studied.  It  should  be  remarked  that  the  classifications  function  in 
the  standard  fuzzy  c-regression  reveals  the  shape  of  regression  hyperplane  by 
its  maximum  values,  whereas  the  entropy  method,  the  deterministic  annealing, 
and  the  mixture  model  do  not  express  those  shapes  of  the  regressions.  Hence  to 
observe  outlines  and  global  characteristics  of  the  clusters  by  the  latter  class  of 
methods  require  other  types  of  functions,  which  we  will  study  from  now. 

The  importance  of  the  entropy  method  is  that  it  stands  between  the  fuzzy  c- 
means  and  the  mixture  model.  Moreover  the  deterministic  annealing  is  equivalent 
to  the  entropy  method.  It  thus  is  based  on  fuzzy  sets  and  at  the  same  time  a 
probabilistic  interpretation  is  possible. 

Future  studies  include  theoretical  investigations  of  the  classification  functions 
in  the  mixture  of  normal  distributions  and  discussion  of  other  variations  of  fuzzy 
c-means  using  the  classification  functions. 

This  study  has  partly  been  supported  by  TARA  (Tsukuba  Advanced  Re¬ 
search  Alliance),  University  of  Tsukuba. 
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Abstract.  Conventional  studies  on  rule  discovery  and  rough  set  meth¬ 
ods  mainly  focus  on  acquisition  of  rules,  the  targets  of  which  have  mu¬ 
tually  exclusive  supporting  sets.  However,  mutual  exclusiveness  does  not 
always  hold  in  real-world  databases,  where  conventional  probabilstic  ap¬ 
proaches  cannot  be  applied.  In  this  paper,  first,  we  show  that  these 
phenomena  are  easily  found  in  data  mining  contexts:  when  we  apply 
attribute-oriented  generalization  to  attributes  in  databases,  generalized 
attributes  will  have  fuzziness  for  classification.  Secondly,  we  show  that 
real-world  databases  may  have  fuzzy  contexts.  Then,  finally,  these  con¬ 
texts  should  be  analyzed  by  using  fuzzy  techniques,  where  context-firee 
fuzzy  sets  will  be  a  key  idea. 


1  Introduction 

Conventional  studies  on  machine  leaming[10],  rule  discovery [2]  and  rough  set 
methods [5,  12,  13]  mainly  focus  on  acquisition  of  rules,  the  targets  of  which 
have  mutually  exclusive  supporting  sets.  Supporting  sets  of  target  concepts  form 
a  partition  of  the  universe,  and  each  method  search  for  sets  which  covers  this 
partition.  Especially,  Pawlak’s  rough  set  theory  shows  the  family  of  sets  can  form 
an  approximation  of  the  partition  of  the  universe.  These  ideas  can  easily  extend 
into  probabilistic  contexts,  such  as  shown  in  Ziarko’s  variable  precision  rough 
set  model [15].  However,  mutual  exclusiveness  of  the  target  does  not  always  hold 
in  real-world  databases,  where  conventional  probabilstic  approaches  cannot  be 
applied. 

In  this  paper,  first,  we  show  that  these  phenomena  are  easily  found  in  data 
mining  contexts:  when  we  apply  attribute-oriented  generalization  to  attributes 
in  databases,  generalized  attributes  will  have  fuzziness  for  classification.  In  this 
case,  we  have  to  take  care  about  the  conflicts  between  each  attributes,  which 
can  be  viewed  as  a  problem  with  multiple  membership  functions.  Secondly,  we 
will  see  that  real-world  databases  may  have  fuzzy  contexts.  Usually,  some  kind  of 
experts  use  multi-valued  attributes,  corresponding  to  a  list.  Especially,  in  medical 
context,  people  may  have  several  diseases  during  the  same  period.  These  cases 
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also  violate  the  assumption  of  mutual  exclusiveness.  Then,  finally,  these  contexts 
should  be  analyzed  by  using  fuzzy  techniques,  where  context-free  fuzzy  sets  will 
be  a  key  idea  to  solve  this  problem. 

2  Attribute-Oriented  Generalization  and  Fuzziness 

In  this  section,  first,  a  probabilistic  rule  is  defined  by  using  two  probabilistic 
measures.  Then,  attribute-oriented  generalization  is  introduced  as  tranforming 
rules. 


2.1  Probabilistic  Rules 

Accuracy  and  Coverage  In  the  subsequent  sections,  we  adopt  the  following 
notations,  which  is  introduced  in  [9]. 

Let  U  denote  a  nonempty,  finite  set  called  the  universe  and  A  denote  a 
nonempty,  finite  set  of  attributes,  i.e.,  a  :  1/  -4  K  for  a  €  A,  where  Va  is  called 
the  domain  of  a,  respectively.Then,  a  decision  table  is  defined  as  an  information 

system,  A  =  (C/,  A  U  {d}).  ^  ,  r 

The  atomic  formulas  over  B  C  A  U  {d}  and  V  axe  expressions  of  the  form 
[a  =  v]j  called  descriptors  over  B,  where  a  €  B  and  v  eVa^  The  set  F(B,y)  of 
formulas  over  B  is  the  least  set  containing  all  atomic  formulas  over  B  and  closed 
with  respect  to  disjunction,  conjunction  and  negation. 

For  each  /  €  F(B,y),  Ja  denote  the  meaning  of  /  in  A,  i.e.,  the  set  of  all 
objects  in  U  with  property  /,  defined  inductively  as  follows. 

1.  If  /  is  of  the  form  [a  =  v]  then,  /a  =  {«  €  17| o(s)  =  -y} 

2.  (/  A  g)A  =  /a  n  pa;  (/  V  p)a  =  /a  V  pa;  hf)A  ^U- fa 

By  the  use  of  this  firamework,  classification  accuracy  and  coverage,  or  true  pos¬ 
itive  rate  is  defined  as  follows. 

Definition!. 

Let  and  D  denote  a  formula  in  F{B,  K )  and  a  set  of  objects  which  belong  to 
a  decision  d.  Classification  accuracy  and  coverage(true  positive  rate)  for  fl  d 
is  defined  as: 

an(D)  =  PiD\R)),  and 

where  |A|  denotes  the  cardinality  of  a  set  A,  aj?(B)  denotes  a  classification 
accuracy  of  B  as  to  classification  of  B,  and  kr{D)  denotes  a  coverage,  or  a  true 
positive  rate  of  R  to  B,  respectively, 

1 


^  Pawlak  recently  reports  a  Bayesian  relation  between  accuracy  and  coverage[8]: 
aR(D)P{D)  =  P(R\D)P{D)  =  P(B,B) 
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Definition  of  Rules 

By  the  use  of  accuracy  and  coverage,  a  probabilistic  rule  is  defined  as: 

d  s,t.  R  =  Aj  Vjfc  [aj  =  Vk]i  olr{D)  >  kr(D)  > 

This  rule  is  a  kind  of  probabilistic  proposition  with  two  statistical  measures, 
which  is  an  extension  of  Ziarko’s  variable  precision  model(VPRS)  [15].^ 

It  is  also  notable  that  both  a  positive  rule  and  a  negative  rule  are  defined  as 
special  cases  of  this  rule,  as  shown  in  the  next  subsections. 


2.2  Attribute-Oriented  Generalization 

Rule  induction  methods  regard  a  database  as  a  decision  table[5]  and  induce  rules, 
which  can  be  viewed  as  reduced  decision  tables.  However,  those  rules  extracted 
from  tables  do  not  include  information  about  attributes  and  they  are  too  simple. 
In  practical  situation,  domain  knowledge  of  attributes  is  very  important  to  gain 
the  comprehensability  of  induced  knowledge,  which  is  one  of  the  reasons  why 
databases  are  implemented  as  relational-databases[l].  Thus,  reinterpretation  of 
induced  rules  by  using  information  about  attributes  is  needed  to  acquire  compre¬ 
hensive  rules.  For  example,  terolism,  cornea,  antimongoloid  slanting  of  palpebral 
fissures,  iris  defects  and  long  eyelashes  are  symptoms  around  eyes.  Thus,  those 
symptoms  can  be  gathered  into  a  category  “eye  symptoms”  when  the  location 
of  symptoms  should  be  focused  on.  symptoms  should  be  focused  on.  The  rela¬ 
tions  among  those  attributes  are  hierarchical  as  shown  in  Figure  1.  This  process, 
grouping  of  attributes,  is  called  attribute-oriented  generalization[l]. 

Attribute-oriented  generalization  can  be  viewed  as  transformation  of  vari¬ 
ables  in  the  context  of  rule  induction.  For  example,  an  attribute  “iris  defects” 
should  be  transformed  into  an  attribute  “eye  symptoms=:yes”.It  is  notable  that 
the  transformation  of  attributes  in  rules  correspond  to  that  of  a  database  because 
a  set  of  rules  is  equivalent  to  a  reduced  decision  table.  In  this  case,  the  case  when 
eyes  are  normal  is  defined  as  “eye  symptoms=no” .  Thus,  the  tranformation  rule 
for  iris  defects  is  defined  as: 

[iris- defects  =  yes]  [eye-symptoms  =  yes]  (1) 


In  general,  when  [Ak  =  VJ]  is  a  upper-level  concept  of  [a^  =  vf],  a  transforming 
rule  is  defined  as: 

[oi  =  Vj]  ->  [Ak  =  Vz], 
and  the  supporting  set  of  [Ak  =  Vi]  is: 

[Ai  =  Vi]a  = 

ij 

=  PiR)P(D\R)  =  kr{D)P{R) 

This  relation  also  suggests  that  a  priori  and  a  posteriori  probabilities  should  be  easily 
and  automatically  calculated  from  database. 

^  This  probabilistic  rule  is  also  a  kind  of  Rough  Modus  Ponens\J]. 
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Face- 


Eye  :  < 
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terolism  <  normal 
I  hypo 

{megalo 
large 
normal 

antimongoloid  slanting  of  palpebral  fissures 


yes 

no 


iris  defects  | 

eyelashes  /  , 

^  1  normal 


Noses 


Fig,  1.  An  Example  of  Attribute  Hierarchy 


where  A  and  o  is  a  set  of  attributes  for  upper-level  and  lower  level  concepts, 
respectively. 


2.3  Examples 

Let  us  illustrate  how  fuzzy  contexts  is  observed  when  attribute-oriented  gener¬ 
alization  is  applied  by  using  a  small  table  (Table  1).  Then,  it  is  easy  to  see  that 


Table  1.  A  Small  Database  on  Congenital  Disorders 


U  round  telorism  cornea  slanting  iris-defects  eyelashes 

class 

1 

no 

normal  megalo 

yes 

yes 

long 

Aarskog 

2 

yes 

hyper  megalo 

yes 

yes 

long 

Aarskog 

3 

yes 

hypo  normal 

no 

no 

normal 

Down 

4 

yes 

hyper  normcd 

no 

no 

normal 

Down 

5 

yes 

hyper  large 

yes 

yes 

long 

Aarskog 

6 

no 

hyper  megalo 

yes 

no 

long 

Cat-cry 

Definitions:  round:  round  face,  slanting:  antimongoloid  slanting  of 
palpebral  fissures,  Aarskog:  Aarskog  Syndrome,  Down;  Down  S5aidrome, 
Cat-cry:  Cat  Cry  Syndrome. 
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a  rule  of  “Aarskog” , 

[iris-defects  =  yes]  Aarskog  a  =  1.0,  /c  =  1.0 
is  obtained  from  Table  1. 

When  we  apply  trasforming  rules  shown  in  Figure  1  to  the  dataset  of  Table 
1,  the  table  is  tranformed  into  Table  2.  Then,  by  using  transformation  rule  1, 


Table  2.  A  Small  Database  on  Congenital  Disorders  (Transformed) 

U  eye  eye  eye  eye  eye  eye  class 

1  no  no  yes  yes  yes  yes  Aarskog 

2  yes  yes  yes  yes  yes  yes  Aarskog 

3  yes  yes  no  no  no  no  Down 

4  yes  yes  no  no  no  no  Down 

5  yes  yes  yes  yes  yes  yes  Aarskog 

6  no  yes  yes  yes  no  yes  Cat-cry 
Definitions:  eye:  eye-symptoms 


the  above  rule  is  trasformed  into: 

[eye-symptoms  =  yes]  Aarskog. 

It  is  notable  that  mutual  exclusiveness  of  attributes  has  been  lost  by  tranforma- 
tion.  Since  five  attributes  (telorism,  cornea,  slanting,  iris-defects  and  eyelashes) 
are  generalized  into  eye-symptoms^  the  candiates  for  accuracy  and  coverage  will 
be  (5/6,  2/3),  (3/4,  3/3),  (3/4,  3/3),  (3/3,  3/3),  and  (3/4,  3/3),  respectively. 
Then,  we  have  to  select  which  value  is  suitable  for  the  context  of  this  analysis. 

In  [11],  one  of  the  authors  selected  the  mimimum  value  in  medical  context: 
accuracy  is  equal  to  3/4  and  coverage  is  equal  to  2/3. 

Thus,  the  rewritten  rule  becomes  the  following  probabilistic  rule: 

[eye-symptoms  =  yes]  Aarskog  a  =  3/4  =  0.75, «  =  2/3  =  0.67. 

This  examples  show  that  the  loss  of  mutual  exclusiveness  is  directly  con¬ 
nected  to  the  emergence  of  fiiziness  in  a  dataset.  It  it  notable  that  the  rule  used 
for  transformation  is  a  deterministic  one.  When  this  kind  of  transformation  is 
applied,  whether  applied  rule  is  deterministic  or  not,  fuzziness  will  be  observed. 
However,  no  researchers  has  pointed  out  this  problem  with  combination  of  rule 
induction  and  tranformation. 

It  is  also  notable  that  the  conflicts  between  attributes  with  respect  to  ac- 
cuarcy  and  coverage  correponds  to  the  vector  representation  of  membership 
functions  shown  in  Lin’s  context-free  fuzzy  sets [4]. 
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3  Multi-valued  Attributes  and  Fuziness 

Another  case  of  the  violation  of  mutual  exclusiness  is  when  experts  use  multi¬ 
valued  attributes,  or  a  list  to  describe  some  attributes  in  a  database.  It  is  a  very 
usual  way  when  we  cannot  expect  the  number  of  inputs  for  attributes. 

For  example,  in  medical  context,  traffic  accidents  may  injure  several  parts  of 
bodies.  Some  patients  have  the  damage  only  on  hands  and  other  ones  suffer  from 
multiple  injuries,  which  makes  us  difficult  to  fix  the  number  of  attributes.  Even 
if  we  enumerate  all  the  possibilities  of  injuries  and  fix  the  number  of  columns 
corresponding  to  the  worst  case,  most  of  the  patients  may  have  only  a  sniall 
number  of  them  to  be  input.  Usually,  medical  experts  are  not  good  at  estimation 
of  possibile  inputs,  and  they  are  tend  to  make  a  list  for  data  storage  for  the  worst 
cases,  although  the  probability  for  such  cases  is  very  low.  For  example,  if  medical 
experts  empirically  knows  that  the  number  of  injuries  is  at  most  20,  they  will 
set  up  20  columns  for  input.  However,  if  the  averaged  number  of  injuries  is  4  or 
5,  all  the  remaining  attributes  will  be  stored  as  blank.  Table  3  illustrates  this 
observation.  Although  these  attributes  look  like  missing  values,they  should  not 
be  dealt  with  as  missing  values  and  have  to  be  preprocessed:  such  large  columns 
should  be  tranformed  into  binary  ones.  For  the  above  example,  each  location  of 
injury  will  be  appended  as  a  column,  and  if  that  location  is  not  described  in  a 
list,  then  the  value  of  that  column  should  be  set  to  0. 


T^ble  3.  A  Small  Database  on  Fracture 

u  f  f  f  f  f  f  f  f  f  fffffffffff 

1.  arm  finger  shoulder  -  -  -  -  -  “ 

2.  foot  -  -  -  -  -  ■  ■  . 

3.  arm  -  -  -  -  "  "  "  . . 

4  j.j|j  _  _  _  -  -  -  - 

5.  head  neck  shoulder  radius  ulnaris  finger  rib  pelvis  femoral - - - 

6.  femoral  tibia  calneus  _ ^ ~  . . .  ~  " 

Definitions:  f:  fracture. 


It  is  easy  to  see  that  mutual  exclusiveness  of  attributes  is  violated  in  this  case. 
Readers  may  say  that  if  data  is  tranformed  into  binary  attributes  then  mutual 
exclusiveness  will  be  recovered.  For  example,  if  one  of  the  above  attribute-value 
pairs  [fracture  =  neck]  is  tranformed  into  [neck fracture  =  yes]  and  others  are 
tranformed  in  the  same  way,  then  the  datatable  will  be  tranformed  into  a  reg¬ 
ular  information  table  with  binary  attributes.  It  is  a  very  good  approach  when 
this  attribute  is  a  conditional  one.  But  when  a  decision  attribute  is  described  as 
a  list,  then  it  may  be  more  difficult  to  deal  with.  For  example,  let  us  consider 
the  case  shown  in  Table  3.  Mutual  exclusiveness  of  decision  attributes  does  not 
hold  in  this  table.  One  solution  is  to  construct  new  attributes  represented  by 
the  conjunciton  of  several  diseases  for  construction  of  a  new  partition  of  the 


218 


universe.^  However,  when  the  number  of  attribute-value  pairs  is  large,  this  solu¬ 
tion  may  be  quite  complicated.  Also,  the  conjunction  may  not  be  applicable  to 
some  domains. 


Table  4.  A  Small  Database  on  Bacterial  Tests 


u 

Diseases  Diseases 

Diseases 

1. 

Heart  Failure  SLE 

Renal  Failure 

2. 

Pneumonia 

- 

3.  Pulmonary  Emboli 

- 

4. 

SLE  PSS 

Renal  Failure 

5. 

Liver  Cirrohsis  Heart  Failure  Hypertension 

4  Functional  Representation  of  Context-Free  Fuzzy  Sets 

Lin  has  pointed  out  problems  with  multiple  membership  functions  and  intro¬ 
duced  relations  between  context-free  fuzzy  sets  and  information  tables[4].  The 
main  contribution  of  context-free  fuzzy  sets  to  data  mining  is  that  information 
tables  can  be  used  to  represent  multiple  fuzzy  membership  functions.  Usually 
when  we  meet  multiple  membership  functions,  we  have  to  resolve  the  conflicts 
between  functions.  Lin  discusses  that  this  resolution  is  bounded  by  the  con¬ 
text:  min,  maximum  and  other  fuzzy  operators  can  be  viewed  as  a  context.  The 
discussion  in  Section  2  illustrates  Lin’s  assertion.  Especially,  when  we  analyze 
relational-database,  tranformation  will  be  indispensable  to  data  mining  of  multi¬ 
tables.  However,  tranformation  may  violate  mutual  exclusiveness  of  the  target 
information  table.  Then,  multiple  fuzzy  membership  functions  will  be  observed. 

Lin’s  context-free  fuzzy  sets  shows  such  analyzing  procedures  as  a  simple 
function  as  shown  in  Figure  4.  The  important  parts  in  this  algorithm  are  the  way 
to  construct  a  list  of  membership  functions  and  the  way  to  determine  whether 
this  algorithm  outputs  a  metalist  of  a  list  of  membership  functions  or  a  list  of  nu¬ 
merical  values  obtained  by  application  of  fuzzy  operators  to  a  list  of  membership 
functions. 


5  Conclusions 

This  paper  shows  that  mutual  exclusiveness  of  conditional  and  decision  at¬ 
tributes  does  not  always  hold  in  real-world  databases,  where  conventional  prob- 
abilstic  approaches  cannot  be  applied. 


This  idea  is  closely  related  with  granular  computation[3,  14]. 
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procedure  Resolution  of  Multiple  Memberships] 

var 

i  :  integer]  La^Li  :  List] 

A:  a  list  of  Attribute- value  pairs  (multisets :bag); 

F:  a  list  of  fuzzy  operators; 
begin 
Li  :=  A] 

while  {A  ^  {})  do 
begin 

[ai  =  i;j](fc)  =  first{A)] 

Applend  p{[ai  =  Wj](fe))  to  L[ai=vj] 

/*  L[ai=vj]’  a  list  of  membership  function  for  attribute- value  pairs  */ 
A  :=  A  [flt  = 

end. 

if  (F  =  {})  then 
/*  Context-  Free  */ 
return  all  of  the  lists  L[ai=:vjy, 

else 

/*  Resolution  with  Contexts*/ 
while  (F  /  {})  do 
begin 

/  =  first{F)] 

Apply  f  to  each  L[ai=vj]; 

Output  all  of  the  membership  functions  pf{[ai  =  Vj]) 

F:=F-f] 

end. 

end  {Resolution  of  Multiple  Memberships}] 


Fig.  2.  Resolution  of  Multiple  Fuzzy  Memberships 


It  is  surprising  that  tranformation  will  easily  generate  this  situation  in  data 
mining  from  relation  databases;  when  we  apply  attribute-oriented  generaliza¬ 
tion  to  attributes  in  databases,  generalized  attributes  will  have  fuzziness  for 
classification.  In  this  ceise,  we  have  to  take  care  about  the  conflicts  between 
each  attributes,  which  can  be  viewed  as  a  problem  with  multiple  membership 
functions.  Also,  real-world  databases  may  have  fuzzy  contexts  when  we  store 
multiple- values  for  each  attribute.  It  is  notable  that  this  phenomenon  is  quite 
natural  at  least  in  medical  doamin.  Finally,  the  authors  pointed  out  that  these 
contexts  should  be  analyzed  by  using  fuzzy  techniques,  where  context-free  fiizzy 
sets  will  be  a  key  idea  to  solve  this  problem.  It  will  be  our  future  work  to  induce 
fuzzy  if  —  then  rules  from  this  database  and  to  compare  these  fuzzy  rules  with 
other  conventional  approaches. 
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Abstract.  Classical  statistics  and  many  data  mining  methods  rely  on  “statistical 
significance”  as  a  sole  criterion  for  evaluating  alternative  hypotheses.  In  this 
paper,  we  use  a  novel,  fuzzy  logic  approach  to  perform  hypothesis  testing.  The 
method  involves  four  major  steps:  hypothesis  formulation,  data  selection 
(sampling),  hypothesis  testing  (data  mining),  and  decision  (results).  In  the 
hypothesis  formulation  step,  a  null  hypothesis  and  set  of  alternative  hypotheses 
are  created  using  conjunctive  antecedents  and  consequent  functions.  In  the  data 
selection  step,  a  subset  D  of  the  set  of  all  data  in  the  database  is  chosen  as  a 
sample  set.  This  sample  should  contain  enough  objects  to  be  representative  of 
the  data  to  a  certain  degree  of  satisfaction.  In  the  third  step,  the  fuzzy 
implication  is  performed  for  the  data  in  D  for  each  hypothesis  and  the  results 
are  combined  using  some  aggregation  function.  These  results  are  used  in  the 
final  step  to  determine  if  the  null  hypothesis  should  be  accepted  or  rejected.  The 
method  is  applied  to  a  real-world  data  set  of  medical  diagnoses.  The  automated 
perception  approach  is  used  for  comparing  the  mapping  functions  of  fuzzy 
hypotheses,  tested  on  different  age  groups  (“young”  and  “old”).  The  results  are 
compared  to  the  “crisp”  hypothesis  testing. 


Keywords.  Hypothesis  testing,  fuzzy  set  theory,  data  mining,  knowledge 
discovery  in  databases,  approximate  reasoning. 
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1  Introduction 

The  analysis  of  medical  data  has  always  been  a  subject  of  considerable  interest  for 
governmental  institutions,  health  care  providers,  and  insurance  companies.  In  this 
study,  we  have  analyzed  a  data  set,  generously  provided  by  the  Computing  Division 
of  the  Israeli  Ministry  of  Health.  It  includes  the  demographic  data  and  medical 
diagnoses  (death  causes)  of  33,134  Israeli  citizens  who  passed  away  in  the  year  1993. 
The  file  does  not  contain  any  identifying  information  (like  names  or  personal  IDs). 

In  the  original  database,  the  medical  diagnosis  is  encoded  by  an  international,  6- 
digit  code  (ICD-9-CM).  The  code  provides  highly  detailed  information  on  the 
diseases:  the  1993  file  includes  1,248  distinct  codes.  Health  Ministry  officials  have 
grouped  these  codes  into  36  sets  of  the  most  common  death  causes,  based  on  the  first 
three  digits  of  the  code. 

It  is  a  well-known  fact  that  there  is  an  association  between  a  person’s  age  and  the 
likelihood  of  having  certain  diagnoses  (e.g.,  heart  diseases  are  more  frequent  among 
older  people).  Though  this  association  is  present  in  most  types  of  human  diseases  (and 
even  some  unnatural  causes  of  death),  it  is  not  necessarily  significant,  in  the  practical 
sense,  for  any  diagnosis.  Thus,  if  a  certain  disease  is  more  likely  by  only  2%  among 
people  over  the  age  of  40  than  among  younger  people,  this  can  hardly  have  any 
impact  on  the  Medicare  system.  Nevertheless,  if  the  last  fact  is  based  on  a  sufficiently 
large  sample,  its  statistical  significance  may  be  very  high. 

Our  purpose  here  is  to  find  the  types  of  medical  diagnoses  where  the  difference 
between  young  people  and  elderly  people  is  practically  significant.  Once  these 
diagnoses  are  detected,  the  Ministry  of  Health  (like  any  other  health  care 
organization)  can  invest  a  larger  part  of  its  budget  in  preventing  the  related  death 
causes  in  certain  age  groups  of  the  population.  Thus,  for  every  possible  cause  (e.g., 
cancer,  heart  disease,  or  traffic  accident)  we  are  testing  a  single  hypothesis  saying, 
“The  elderly  people  are  more  (less)  likely  to  die  from  this  cause  than  the  young 
people.”  Since  the  number  of  available  hypotheses  is  strongly  limited  (the  ministry 
officials  have  identified  36  sets  of  major  causes),  each  hypothesis  will  be  tested  by  a 
verification-oriented  approach.  For  a  concise  comparison  between  verification- 
oriented  and  discovery-oriented  methods  of  data  mining,  see  Fayyad  etal[\]. 

This  paper  is  organized  as  follows.  In  the  next  section,  we  describe  a  “crisp” 
approach  to  hypothesis  testing,  aimed  at  measuring  the  statistical  significance  of  each 
hypothesis.  The  limitations  of  applying  this  “classical”  statistical  approach  to  real- 
world  problems  of  data  analysis  are  clearly  emphasized.  In  Section  3,  we  proceed 
with  representing  a  novel  methodology  of  fuzzy  hypothesis  testing  for  verification- 
based  data  mining.  The  analysis  of  the  medical  data  set  by  using  the  “crisp”  approach 
and  the  fuzzy  approach  to  hypothesis  testing  is  performed  in  Section  4.  Section  5 
concludes  the  paper  by  comparing  the  results  of  the  two  methods  and  outlining  other 
potential  applications  of  the  Fuzzy  Set  Theory  to  the  area  of  data  mining. 
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2  “Crisp”  Hypothesis  Testing 

Statistical  hypothesis  testing  is  a  process  of  indirect  proof  [6].  This  is  because  the  data 
analyst  assumes  a  single  hypothesis  (usually  called  the  null  hypothesis)  about  the 
underlying  phenomenon  to  be  true.  In  the  case  of  medical  data,  the  simplest  null 
hypothesis  may  be  that  the  likelihood  of  people  under  40  having  heart  disease  is  equal 
to  the  likelihood  of  people  over  40.  The  objective  of  a  statistical  test  is  to  verify  the 
null  hypothesis.  The  test  has  a  “crisp”  outcome:  the  null  hypothesis  is  either  rejected 
or  retained  (see  [6]).  According  to  the  statistical  theory,  retaining  the  null  hypothesis 
should  not  be  interpreted  as  accepting  that  hypothesis.  Retaining  just  means  that  we 
do  not  have  sufficient  statistical  evidence  that  the  null  hypothesis  is  not  true.  On  the 
other  hand,  rejecting  the  null  hypothesis  implies  that  there  are  an  infinite  number  of 
alternative  hypotheses,  one  of  them  being  true.  In  our  example,  the  set  of  alternative 
hypotheses  includes  all  non-zero  differences  between  the  probabilities  of  the  same 
disease  in  the  two  distinct  population  groups. 

The  statistical  theory  of  hypothesis  testing  deals  with  a  major  problem  of  any  data 
analysis:  the  limited  availability  of  target  data.  In  many  cases,  it  is  either  impossible 
or  too  expensive  to  collect  information  about  all  the  relevant  data  items.  Hence,  a 
random  sample,  selected  from  the  entire  population,  is  frequently  used  for  testing  the 
null  hypothesis.  In  the  random  sample,  like  in  the  entire  population,  we  may  find 
some  evidence  contradicting  the  statement  of  the  null  hypothesis.  This  does  not 
necessarily  mean  that  the  null  hypothesis  is  wrong:  the  real  data  is  usually  affected  by 
many  random  factors,  known  as  noise.  Representing  the  distribution  of  noise  in  the 
sample  cases  is  an  integral  part  of  the  null  hypothesis.  Thus,  for  comparing  means  of 
continuous  variables  derived  from  large  samples,  the  assumption  of  a  Normal 
distribution  (based  on  the  Central  Limit  Theorem)  is  frequently  used. 

To  compare  between  the  probabilities  of  a  diagnosis  in  two  distinct  age  groups,  we 
need  to  perform  the  comparison  between  proportions  test  (see  [5]).  This  test  is  based 
on  two  independent  random  samples,  extracted  from  two  populations.  The  sizes  of  the 
samples  do  not  have  to  be  equal,  but  to  apply  the  Central  Limit  Theorem,  each  sample 
should  include  at  least  30  cases.  Furthermore,  we  assume  that  each  person  in  the  same 
age  group  has  exactly  the  same  probability  of  having  a  certain  disease.  The  last 
assumption  enables  us  to  describe  the  actual  number  of  “positive”  and  “negative” 
cases  in  each  group  by  using  the  Binomial  distribution. 

The  massive  use  of  the  “crisp”  hypothesis  testing  by  many  generations  of 
statisticians  has  not  eliminated  the  confusion  associated  with  its  practical  application. 
Retaining  a  hypothesis  is  supposed  to  increase  our  belief  in  it  -  but  how  much  greater 
should  our  belief  be  now?  Statistics  gives  no  clear  answer.  Rejecting  a  hypothesis 
leaves  us  even  more  confused:  we  are  not  supposed  to  believe  in  the  null  hypothesis 
anymore.  However,  which  alternative  hypothesis  should  be  considered  true? 

Apparently,  the  significance  level  may  be  used  as  a  continuous  measure  of 
evaluating  hypotheses.  However,  as  indicated  by  [6],  “significant”  is  a  purely 
technical  term  and  it  should  not  be  confused  with  the  practical  terms  “important,” 
“substantial,”  “meaningful,”  etc.  Very  large  samples  may  lead  us  to  statistically 
significant  conclusions,  based  on  negligible  differences  between  estimates.  In  other 
words,  statistical  significance  does  not  imply  practical  significance.  In  the  next 
section,  we  describe  a  novel,  fuzzy  method  for  determining  the  validity  of  a 
hypothesis  on  a  continuous  scale. 
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3  Fuzzy  Hypothesis  Testing 

The  concept  of  fuzzy  testing,  or  more  specifically,  fuzzy  hypothesis  testing  [7]  is  a 
verification-based  method  of  data  mining.  A  fuzzy  hypothesis  test  is  used  to 
determine  the  truth  (or  falsity)  of  a  proposed  hypothesis.  The  hypothesis  may  involve 
either  crisp  or  fuzzy  data;  however,  a  fuzzy  hypothesis  test  should  produce  a  value  on 
[0,1],  which  indicates  the  degree  to  which  the  hypothesis  is  valid  for  given  sample 
data.  This  is  an  extension  of  the  classical  hypothesis  test,  which  yields  a  crisp  value  in 
{0,1}  (see  above).  The  fuzzy  hypothesis  test  will  accept  the  null  hypothesis  to 
some  degree  |X  and  the  alternative  hypothesis  H,  to  some  degree  1-p. 


3.1  The  Formal  Notation 

A  set  of  collected  data,  i.e.  a  database,  is  defined; 

X={x„x„x„...,x„} 

where  m  is  the  number  of  cases  (records)  in  the  database  and  x.  is  an  w-dimensional 
vector  in  an  w-dimensional  feature  space: 

A  set  DcX  is  chosen,  called  a  sample  set,  which  will  be  used  to  test  the  hypothesis. 
Next,  choose  a  set  of  hypotheses  H={Ho,Hp...,Hj-}  where  H^is  the  null  hypothesis  to 
accept  or  reject  and  Hj  through  are  the  alternate  hypotheses  we  must  accept  if  we 
reject  H^,.  A  hypothesis  can  be  thought  of  as  an  implication  of  the  form: 

if  condition ^  and  condition^  and  ...  condition^ 
then  X  is  a  member  of  F  with  membership  ji(Xj) 

In  other  words,  a  hypothesis  is  composed  of  a  set  C  of  A;  conjunctive  antecedent 
conditions  and  a  consequent  classification  (e.g.  cluster,  fuzzy  set)  F.  A  condition  is  a 
comparison  of  one  of  the  components  of  x,  and  a  constant  (possibly  fuzzy)  value.  |Li  is 
defined  as  a  mapping:  jLi(x.,H)  — >  [0,1]. 

In  the  medical  dataset,  examples  of  conditions  include: 

•  “A  person  lives  in  the  city  of  Haifa”  (a  crisp  condition) 

•  “A  person  is  old”  (a  fuzzy  condition) 

The  value  of  |Li  determines  whether  the  data  collected  agrees  with  the  hypothesis.  A 
value  of  |Xo=l  means  the  data  is  in  total  agreement  with  the  null  hypothesis;  a  value  of 
p„=0  means  the  data  totally  contradicts  the  null  hypothesis.  Additionally,  the  value  of 
p.  for  the  alternative  hypotheses  should  be  the  inverse  of  that  of  H,„  i.e.  p-,-i-p2+...ppl- 


3.2  Calculating  the  Sample  Size 

Since  it  may  not  always  be  practical  or  possible  to  use  all  collected  data  (i.e.  the  entire 
database),  a  sampling  of  data,  called  a  sample  set,  is  used  to  verify  the  hypotheses. 
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The  sample  set  D  is  usually  chosen  at  random  from  among  the  set  X  (the  entire 
database).  This  random  sampling  must  be  large  enough  to  make  sure  that  the  set  D  is 
“good”;  i.e.  that  D  reflects  the  contents  of  X.  If  D  =  X  it  must  be  accepted;  the  sample 
is  the  entire  database.  If  D  =  0,  it  must  be  rejected;  the  sample  contains  no  data. 
Otherwise,  the  number  of  data  in  D,  denoted  d=\D\,  will  determine  if  it  is  “good.” 

The  following  function,  called  the  degree  of  satisfaction  (DoS),  is  chosen  to 
represent  the  belief  that  D  is  a  good  sample  of  X  based  on  d  (the  sample  size)  and  m 
(the  size  of  the  entire  data  set): 


log(-) 

- ^  +  1  whenrf>  — 

\og(b)  b 

0  otherwise 


(1) 


where  ^  is  a  constant  that  controls  the  x-intercept  of  the  function  (the  sample  size  of 
zero  satisfaction).  Larger  values  of  b  make  the  intercept  closer  to  0.  For  example, 
when  ^=10,  the  x-intercept  is  at  10%  of  m  (10%  of  the  items  are  guaranteed  to  be 
selected);  for  b=\00,  the  x-intercept  is  1%  of  m  (the  minimal  sample  size  is  1%). 
Figure  1  shows  the  graph  of  the  function /for  ^=10.  In  the  graph,  the  x-axis  is  the 
percentage  of  the  total  data,  m,  selected  for  D.  In  other  words,  the  x-axis  is  d/m,  where 
0  is  0%  and  1.0  is  100%.  The  function/is  used  to  select  the  number  of  items  for  D:  as 
f(d,m)^  \,d  m.  Thus,  the  sample  becomes  closer  to  100%  for  higher  degrees  of 
satisfaction  required. 

The  function  is  chosen  as  it  meets  the  criteria  given  above  for  selecting  the  size  of 
a  "good"  sample  set.  If  d=m  (i.e.  the  entire  database),  then/=1.0.  If  d=0  (i.e.  no  data), 
then  /=0.0.  The  introduction  of  variable  b  allows  us  to  set  a  stronger  condition  of 
f=0.0  when  d  <  m/b,  if  we  have  a  preference  that  there  should  be  some  lower  limit  on 
the  number  of  items  selected  for  the  sample.  We  chose  the  logarithm  function  because 
of  its  shape.  From  the  figure  we  see  that  as  we  add  items  to  the  sample,  the  function  f 
increases  faster  at  the  beginning  than  later,  when  the  sample  set  is  larger.  This  agrees 
intuitively  with  our  notion  of  how  a  sample  works:  more  items  are  generally  better, 
but  once  we  have  a  certain  amount  of  items  in  our  sample  the  additional  information 
provided  by  adding  more  items  is  less  than  that  of  adding  the  same  number  of  items  to 
a  smaller  sample. 


Fig.  1.  Plot  of  function /(^?=  10) 


As  shown  above,  the  fuzzy  method  of  calculating  the  sample  size  does  not  depend 
on  the  hypotheses  we  are  going  to  test  on  the  data.  This  approach  agrees  with  the 
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common  process  of  knowledge  discovery  in  databases  (see  [1]),  where  the  target  data 
is  selected  before  the  data  mining  stage.  The  procedure  for  selecting  an  appropriate 
sample  size,  suggested  by  the  statisticians  (see  [6]),  is  more  complicated  and  it 
assumes  knowing  in  advance  both  the  hypotheses  to  be  tested  and  the  underlying 
distributions.  According  to  [6],  the  first  step  is  specifying  the  minimum  effect  that  is 
“important”  to  be  detected  by  the  hypothesis  testing.  The  linguistic  concept  of 
importance  is  certainly  beyond  the  scope  of  the  statistical  inference.  However,  it  is 
directly  related  to  the  process  of  approximate  reasoning,  easily  represented  by  the 
Fuzzy  Set  Theory  (see  [2]). 


3.3  Creating  the  Mapping  Function  for  Each  Hypothesis 

The  mapping  function  \i.  maps  each  vector  in  D  for  a  given  hypothesis  Hj  to  a  value  in 
[0,1].  This  number  represents  the  degree  to  which  each  sample  agrees  with  the 
hypothesis.  In  order  to  determine  the  agreement,  the  membership  function  of  the 
consequent  F,  must  be  known.  If  the  data  described  by  the  vector  x  lies  within  F.,  then 
Pj  should  equal  the  degree  of  membership  of  x  in  F..  Usually  F.  will  be  some 
geometric  function  on  [0,1],  such  as  a  triangular  or  trapezoidal  shaped  function. 

The  vectors  in  D  are  compared  with  the  conjunctive  conditions  in  the  antecedent  of 
the  hypothesis.  For  crisp  conditions,  any  condition(s),  which  are  false,  cause  x  to  be 
excluded  from  consideration  since  they  do  not  lend  any  support  to  the  null  hypothesis 
or  alternative  hypotheses.  For  fuzzy  conditions,  it  may  be  necessary  to  use  some 
threshold  value  to  determine  if  the  vector  x  should  be  excluded.  For  example,  for  a 
fuzzy  value  of  0.5  or  less,  the  vector  x  may  be  closer  to  some  other  fuzzy  set.  Each 
fuzzy  condition  in  the  antecedent  will  have  a  value  on  [0,1]  for  each  x,  and  these 
values  must  be  combined  using  a  t-norm  operation,  such  as  min.  The  resulting  value 
indicates  the  degree  to  which  x  supports  the  antecedent  conditions  of  H.  The  Dienes- 
Rescher  fuzzy  implication  [8]  is  then  performed  for  the  combined  antecedent  values 
and  the  consequent  value: 


p,  =  max(l-P,^) 


(2) 


where  P  is  the  value  of  the  combined  antecedents  and  /  is  a  function  describing  the 
fuzzy  membership  of  the  consequent.  Here  the  subscript  /  denotes  to  which  hypothesis 
each  variable  belongs;  it  will  range  from  0  (the  null  hypothesis)  to  k,  for  k  alternative 
hypotheses.  Thus,  Pj  would  be  the  antecedents  for  hypothesis  H2,  would  be  the 
fuzzy  membership  of  the  consequent  for  hypothesis  H3,  etc. 

Once  the  membership  Po  for  each  x  in  D  is  determined,  the  values  must  be 
aggregated  to  determine  if  the  values  in  D,  taken  as  a  whole,  support  H^,.  This  can  be 
done  in  a  variety  of  ways  including  arithmetic  mean  (each  point  contributes  to  the 
decision),  minimum  (pessimistic  -  if  any  x  fail  H„,  then  H„  is  rejected),  or  maximum 
(optimistic  -  if  any  x  pass  H^,,  then  H^  is  accepted).  For  arithmetic  mean,  denote  the 
overall  mapping  function  M,^  for  hypothesis  k: 
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(3) 


where  6  is  the  number  of  vectors  in  D  that  are  relevant  to  the  hypothesis  under 
consideration. 


3.4  Comparing  Fuzzy  Hypotheses 

In  the  medical  database,  our  objective  is  to  compare  between  the  overall  mapping 
functions  of  two  hypotheses: 

•  Hypothesis  No.  1 :  If  the  age  is  young,  then  diagnosis  (cause)  =  x 

•  Hypothesis  No.  2:  If  the  age  is  old,  then  diagnosis  (cause)  =  x 

If  the  second  mapping  function  is  significantly  greater  (or  significantly  smaller) 
than  the  first  one,  then  we  can  conclude  that  older  people  have  a  higher  (or  a  lower) 
likelihood  of  having  that  diagnosis  than  young  people.  “Significantly  greater 
(smaller)”  are  fuzzy  terms  depending  on  human  perception  of  the  difference  between 
the  mapping  functions.  We  have  outlined  a  general  approach  to  automated  perception 
in  [3-4].  For  automating  the  perception  of  this  difference,  we  are  using  here  the 
following  membership  function 

- i - .  M  ,(D)  >  M  ,(D)  W 

1 

- ,  otherwise 

{l  +  expiP*(M,(D)-M^(D))) 


where  p  is  an  adjustable  coefficient  representing  the  human  confidence  in  the 
difference  between  frequencies,  based  on  a  given  sample  size.  The  membership 
function  increases  with  the  value  of  p. 


4  Analysis  of  the  Medical  Data 


4.1  Hypothesis  Testing 

In  order  to  create  the  mapping  functions  for  each  fuzzy  hypothesis,  the  fuzzy  sets 
corresponding  to  “young  age”  and  “old  age”  have  been  determined.  These  fuzzy  sets 
are  shown  in  Fig.  2.  Both  sets  are  represented  by  triangular  membership  functions. 
The  definition  of  these  membership  functions  is  completely  subjective  and  user- 
dependent. 

To  perform  an  objective  comparison  between  the  fuzzy  hypothesis  testing  and  the 
“crisp”  approach,  we  have  used  the  threshold  of  45  years  to  divide  the  records  into 
“young”  and  “old”  people.  Afterwards,  the  proportion  of  each  diagnosis  under  the  age 
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of  45  has  been  compared  statistically  to  the  proportion  of  the  same  diagnosis  for 
people  over  45  years  old.  The  statistical  significance  of  the  difference  between 
proportions  has  been  evaluated  by  the  comparison  between  proportions  test  (see 
Section  2  above). 

Both  methods  of  hypothesis  testing  have  been  applied  to  the  same  random  sample. 
The  sample  size  has  been  determined  by  the  fuzzy  method  of  Section  3.2  above,  using 
DoS  (Degree  of  Satisfaction)  equal  to  0.90  and  the  constant  b  =  100.  The  number  of 
records  obtained  is  20,907  (out  of  33,134),  including  1,908  young  people  and  18,  998 
elderly  people.  For  comparing  fuzzy  hypotheses,  based  on  this  sample  size,  the 
coefficient  p  =  25  has  been  selected. 


Fig.  2.  Fuzzy  sets  “young  age”  and  “old  age” 


4.2  Summary  of  Results 

The  36  diagnoses  present  in  the  medical  dataset  can  be  divided  into  the  following 
categories,  by  the  effect  of  person  age: 

•  Five  diagnoses  (death  causes),  where  the  difference  between  the  young  people 
and  the  elderly  people  is  highly  significant  according  to  both  the  fuzzy  test  and  the 
“crisp”  test.  These  causes  include:  Ischaemic  Heart  Disease,  Cerebrovascular 
Disease,  Diseases  of  Pulmonary  Circulation,  Motor  Vehicle  Traffic  Accidents,  and 
Other  Accidents.  The  likelihood  increases  with  the  age  for  the  first  three  causes 
and  decreases  for  the  last  two.  From  the  viewpoint  of  the  health  care  system,  this 
means  that  older  people  have  a  higher  risk  of  dying  from  the  first  three  diseases. 
Consequently,  this  age  group  should  be  subject  to  frequent  medical  assessments  as 
a  preventive  treatment.  To  decrease  the  number  of  traffic  and  other  accidents  in  the 
young  age  group,  some  restrictions  may  be  applied  (and  are  actually  applied)  with 
respect  to  young  drivers. 

•  Nineteen  diagnoses,  where  the  statistical  significance  of  the  difference  is  also 
very  high  (over  99.9%),  but  the  fuzzy  test  has  shown  a  relatively  low  significance 
varying  from  0.50  to  0.78.  For  example,  only  0.28%  of  young  people  have  diabetes 
vs.  2.77%  of  elderly  people.  The  significance  of  the  fuzzy  test  in  this  case  is  only 
0.65.  However,  the  statistical  test  of  comparison  between  proportions  has  provided 
us  with  a  statistic  z  =  10.44,  which  has  a  very  high  significance  (almost  1.00). 

•  Eleven  diagnoses,  where  the  significance  of  both  tests  is  relatively  low. 

•  One  rare  diagnosis,  which  has  been  completely  omitted  from  the  random  sample. 
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5  Conclusions 

The  results  of  the  hypothesis  testing  represented  in  Section  4  above  emphasize  the 
main  drawback  of  statistical  methods:  the  statistical  significance  should  not  be  used 
as  a  synonym  for  importance.  Relying  solely  on  the  results  of  the  “crisp”  testing  in 
the  above  dataset  would  lead  (actually,  mislead)  the  analysts  into  concluding  that 
almost  all  death  causes  have  a  strong  association  with  the  age.  This  could  cause  a 
wrong  setting  of  health  care  priorities  or  even  completely  ignore  the  age  for  this 
purpose.  The  main  contribution  of  Fuzzy  Set  Theory  to  this  problem  is  the  improved 
differentiation  of  diagnoses,  starting  with  those  completely  unaffected  by  age,  and 
ending  with  the  five  causes  (see  sub-section  4.2  above)  where  the  age  is  the  leading 
factor. 

As  we  have  shown  in  our  work  on  automated  perceptions  [3],  the  potential  benefit 
of  applying  fuzzy  logic  methods  to  data  mining  is  yet  to  be  studied.  After  solving  one 
limitation  of  the  traditional  data  analysis,  moving  from  verification  of  hypotheses  to 
their  discovery,  many  data  mining  methods  are  still  anchored  to  the  statistical  methods 
of  significance  testing.  Consequently,  a  lot  of  unimportant  (mostly,  random) 
hypotheses  are  “discovered”  in  data,  llie  fuzzy  hypothesis  testing  is  challenging  this 
problem. 
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Abstract.  Fuzzy  logic  has  not  been  applied  to  macro-economic  modelling 
despite  advantages  this  technique  has  over  mathematical  and  statistical 
techniques  more  commonly  used.  The  use  of  fuzzy  logic  provides  a  technique 
for  modelling  that  makes  none  of  the  theoretical  assumptions  normally  made  in 
macroeconomics.  However,  in  order  to  avoid  making  assumptions,  we  need  to 
elicit  fuzzy  rules  directly  from  the  data.  This  is  done  using  a  genetic  algorithm 
search  for  rules  that  fit  the  data.  The  technique  discovered  rules  from  artificially 
generated  data  that  was  consistent  with  the  function  used  to  generate  the  data. 
The  technique  was  used  to  discover  rules  that  predict  changes  to  national 
consumption  in  order  to  explore  the  veracity  of  two  economic  theories  that 
propose  different  causes  for  changes  in  consumption.  The  fuzzy  rules  generated 
illustrate  a  more  fine-grained  analysis  of  consumption  than  is  predicted  by 
either  theory  alone.  Predictions  made  using  the  generated  rules  were  more 
accurate  following  ten-fold  cross  validation  than  those  made  by  a  neural 
network  and  a  simple  linear  regression  model  on  the  same  data. 


Introduction 

Macro-economic  modelling  and  forecasting  has  traditionally  been  performed  with  the 
exclusive  use  of  mathematical  and  statistical  tools.  However,  these  tools  are  not 
always  appropriate  for  economic  modelling  because  of  uncertainty  associated  with 
decision  making  by  humans  in  an  economy.  The  development  of  any  economy  is 
determined  by  a  wide  range  of  activities  performed  by  humans  as  householders, 
managers,  or  government  policy  makers.  Persons  in  each  role  pursue  different  goals 
and,  more  importantly,  base  their  economic  plans  on  decision-making  in  vague  and 
often  ambiguous  terms.  For  example,  a  householder  may  make  a  decision  on  the 
proportion  of  income  to  reserve  as  savings  according  to  the  rule-  {IF  my  future  salary 
is  likely  to  diminish,  THEN  I  will  save  a  greater  proportion  of  my  current  salary}. 
Mathematical  models  of  human  decision-making  impose  precise  forms  of  continuous 
functions  and  overlook  the  inherent  fuzziness  of  the  process. 

In  addition  to  imposing  a  crispness  that  may  not  be  appropriate,  mathematical  and 
statistical  models  necessarily  make  assumptions  that  derive  from  economic  theories. 
A  large  variety  of  sometimes  conflicting  models  have  emerged  over  the  years  as  a 
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consequence  of  this.  Inferences  drawn  from  a  model  hold  only  to  the  extent  that  the 
economic  theoretical  assumptions  hold  yet  this  is  often  difficult  to  determine. 

Macroeconomic  researchers  solely  using  mathematical  or  statistical  models  are 
compelled  to  make  assumptions  based  on  their  own  subjective  view  of  the  world  or 
theoretical  background  and  beliefs.  For  example,  hypotheses  generated  by  researchers 
who  accept  Keynesian  assumptions  are  quite  different  from  hypotheses  from  Classical 
theorists.  Hypotheses  are  not  only  dependent  upon  the  subjective  beliefs  of  their 
creators  but  can  easily  become  obsolete.  Completely  different  economic  systems  can 
rise  in  different  times  in  different  countries  and  be  described  by  different  models. 
Thus,  if  making  assumptions  and  deriving  hypotheses  about  an  economy  leads  to 
subjective  models,  and  successful  theories  do  not  last  long,  then  the  following 
questions  arise:  Is  it  possible  to  eliminate  model  dependence  on  the  subjective 
researcher's  assumptions  about  features  and  properties  of  the  object  of  study?;  Can 
there  exist  an  approach  that  automatically  generates  a  hypothetical  basis  for 
constructing  a  model  ?;  Can  this  approach  be  applied  in  different  times  to  different 
types  of  economic  systems  ? 

In  this  paper  we  introduce  a  modelling  approach  that  does  not  rely  on  theoretical 
assumptions  or  subjective  fme-tuning  of  system  parameters.  We  apply  fuzzy  theory 
and  use  an  evolutionary  programming  approach  to  pursue  two  goals: 

1 .  To  provide  a  user  with  a  system,  which  better  represents  uncertainty  caused  by  the 
prevalence  of  human  decision  making  in  an  economy 

2.  To  build  a  forecasting  model  without  any  initial  assumptions,  which  aims  solely  to 
be  consistent  with  observed  economic  data. 

Our  approach  derives  fuzzy  rules  from  macro-economic  data.  We  use  an 
evolutionary  programming  approach  to  search  for  rules  that  best  fit  the  data.  A  user 
can  glance  at  the  rules  and  visualise  the  main  dependencies  and  trends  between 
variables.  Moreover,  if  there  are  exogenous  variables  in  the  model  presented  among 
input  indicators,  a  user  is  able  to  foresee  a  possible  impact  of  their  simulations  on 
other  variables  of  interest.  For  example,  quickly  glancing  at  the  fuzzy  rules  shown  in 
Table  1 ,  we  can  say  that  in  order  to  obtain  a  large  values  of  the  output  we  need  to 
increase  the  value  of  exogenous  variable  x2. _ 


XI  _ 1 

SMALL 

LARGE 

X2 

SMALL 

SMALL 

SMALL 

LARGE 

LARGE 

LARGE 

Table  1.  Sample  frizzy  rules  table 

We  believe  that  fuzzy  logic,  though  not  normally  used  in  macro-economic 
modelling  is  suitable  for  capturing  the  uncertainty  inherent  in  the  problem  domain.  An 
evolutionary  approach  to  building  the  system  can  facilitate  the  design  of  a  system  free 
of  subjective  assumptions,  and  based  only  on  patterns  in  the  data. 

In  the  following  section  we  describe  the  concept  of  hybrid  fuzzy  logic  and  genetic 
algorithms.  Following  that  we  describe  our  method  in  some  detail  and  provide  an 
example  with  data  that  is  generated  from  artificial  rules.  In  Section  5  we  apply  the 
method  to  macro-economic  data  before  outlining  future  directions. 


232 


Mining  forecasting  fuzzy  rules  with  genetic  algorithms 

Our  task  is  to  model  a  macro-economic  environment  and  capture  any  uncertainty  in  a 
macro-economic  agent’s  decision-making  behaviour  in  order  to  generate  predictions 
of  the  economic  system’s  development  in  the  future.  We  are  required  to  mine 
knowledge  of  this  process  in  flexible  human-like  terms.  The  application  of  the  fuzzy 
control  architecture  for  this  forecasting  problem  proceeds  with  the  following 
modifications: 

1.  Macro-economic  data  is  sourced  from  national  repositories. 

2.  The  fuzzy  sets  and  defliziflcation  methods  are  set  as  parameter  features  of  the 

system.  No  attempt  is  made  to  automatically  discover  membership  functions. 

3.  The  rules  governing  the  process  are  required  to  be  discovered  from  the  data. 

Research  in  fuzzy  control  focuses  on  the  discovery  of  membership  functions,  and 
independently  on  the  fine  tuning  of  fuzzy  rules  for  given  data.  Both  research  strands 
are  aimed  at  adjusting  a  fuzzy  control  system  to  the  specific  data.  Several  researchers 
[1],  [3]  used  genetic  algorithms  to  simultaneously  find  fuzzy  rules  and  parameters  of 
the  membership  fimctions.  However  the  simultaneous  search  for  rules  and 
membership  functions  adds  complexity  and  may  not  be  necessary  if  we  are  dealing 
with  economic  data.  With  most  economic  indicators  there  is  general  agreement  about 
the  mapping  of  qualitative  terms  onto  quantitative  terms.  Most  economists  would 
regard  a  gross  domestic  product  (GDP)  rise  of  1%  to  be  low,  one  of  5%  to  be  high. 
There  may  be  some  disagreement  surrounding  a  GDP  value  of  3%  but  there  is 
expected  to  be  little  disagreement  about  the  precise  form  of  the  function  between  high 
and  low. 

In  order  to  reduce  the  complexity  of  the  search  problem,  and,  in  light  of  the  nature 
of  economic  data,  we  do  not  search  for  near  optimal  membership  functions  but 
instead  determine  a  membership  function  that  seems  reasonable.  Fuzzy  rules  are 
discovered  using  an  evolutionary  search  procedure.  Fuzzy  rules  derived  by  the  genetic 
algorithm  are  applicable  only  to  the  pre-set  membership  functions  but  this  is  minor 
limitation. 

Machine  learning  methods  have  been  applied  to  the  problem  of  mining  fuzzy  rules 
from  data.  For  example,  Hayashi  and  Imura  [2]  suggested  a  two-step  procedure  to 
extract  fuzzy  rules.  In  the  first  step,  a  neural  network  (NN)  was  trained  from  sample 
data.  In  the  second  step,  an  algorithm  was  used  to  automatically  extract  fuzzy  rules 
from  the  NN.  Kosko  [6]  interprets  fuzzy  rules  as  a  mapping  between  fuzzy 
membership  spaces  and  proposed  a  system,  called  Fuzzy  Cognitive  Maps,  to  integrate 
NN  and  fuzzy  systems.  Lin  and  Lee  [7]  proposed  a  NN-based  fuzzy  logic  system, 
which  consists  of  five  layers.  The  first  is  linguistic  input,  the  second  and  fourth  are 
terms  representing  a  membership  function,  the  third  is  a  set  of  rules  and  the  fifth  is  an 
output.  The  common  weaknesses  of  NN,  however,  are  the  lack  of  analytical  guidance, 
where  all  relationships  are  hidden  in  the  “black  box”  of  the  network  connections. 
Furthermore,  training  neural  networks  is  not  deterministic  and  the  learning  process 
may  be  trapped  in  local  solutions. 

Another  widely  used  machine  learning  method  used  is  the  induction  of  fuzzy 
decision  trees  where  fuzzy  entropy  is  used  to  guide  the  search  of  the  most  effective 
decision  nodes  [10],  [11],  [13].  Although,  in  most  situations  the  decision  tree 


233 


induction  works  well,  it  has  some  limitations.  According  to  Yuan  and  Zhuang  [14]  the 
one-step-ahead  node  splitting  without  backtracking  may  not  be  able  to  generate  the 
best  tree.  Another  limitation  is  that  even  the  best  tree  may  not  be  able  to  present  the 
best  set  of  rules  [12].  Furthermore,  this  method  has  been  found  to  be  sub-optimal  in 
certain  types  of  problems  such  as  multiplexer  problems  [8]. 

In  this  paper  we  use  an  evolutionary  approach  to  find  fuzzy  rules  from  macro- 
economic  data.  Genetic  algorithms  have  been  used  by  Rutkowska  [9]  to  find  near- 
optimal  fuzzy  rules  and  learn  the  shapes  of  membership  function.  Karr  [4]  also 
focussed  his  work  on  looking  for  a  high-performance  membership  function  using 
genetic  algorithms.  Yuan  and  Zhuang  [14]  discovered  fuzzy  rules  for  classification 
tasks  that  were  most  correct,  complete  and  general 

In  our  work,  we  do  not  seek  rules  that  are  most  general,  complete  and  correct  but 
initially  focus  only  finding  a  complete  list  of  rules  that  best  describe  the  data.  The 
generalisation  of  rules  is  a  manual  process  exercised  if  required.  Often  with  systems 
as  complex  as  dynamic  economies  few  general  rules  are  non-trivial  and  more 
attention  is  focused  on  specific  rules.  Furthermore  in  order  to  find  the  most  general, 
complete  and  concise  rules  Yuan  and  Zhuang  [14]  proposed  definitions  of  these 
concepts.  The  adoption  of  similar  definitions  with  macro-economic  data  is  one  step 
toward  re-introducing  theoretical  assumptions  in  our  model  and  was  thus  avoided. 

In  the  next  section  we  describe  the  procedure  used  to  design  a  genetic  algorithm 
search  for  mining  fuzzy  rules. 


Description  of  method 

To  apply  the  genetic  algorithm  search  there  are  two  main  decisions  to  make: 

1.  How  to  code  the  possible  solutions  to  the  problem  as  a  finite  bit  strings  and 

2.  How  to  evaluate  the  merit  of  each  string. 

Because  solutions  in  our  case  are  fuzzy  terms  in  the  fuzzy  rules  table,  we  construct 
the  solution  strings  as  rows  of  the  rules  table. 

Theoretically,  it  is  possible  to  apply  this  coding  for  genetic  search  in  any  multilevel 
fuzzy  rules  space.  But,  the  length  of  strings  increases  dramatically  with  an  increase  in 
number  of  inputs,  outputs  and  fuzzy  sets,  over  which  these  inputs  and  outputs  are 
defined.  We  limited  our  examples  to  two  inputs  and  one  output  defined  over  four 
fiizzy  sets. 

Although  the  use  of  binary  coding  is  preferable  by  many  researchers  we  use  integer 
numbers.  Binary  coding  can  code  variables,  which  can  take  only  2  values.  To  code 
variables,  which  can  take  more  than  2  values  in  binary  coding  we  have  to  use  several 
genes  to  code  each  variable  and  deal  with  unused  coding  patterns.  To  avoid  this 
complexity  and  to  cover  the  possibility  of  the  appearance  of  more  than  two  values  in 
each  cell  of  fuzzy  the  rule  table,  we  used  integer  coding.  We  assign  numbers  from  1 
to  N  for  each  of  N  fuzzy  sets  defined  for  the  output  variable.  Thus,  each  rule  is 
represented  with  the  corresponding  number  as  a  gene  in  the  coded  chromosomes. 

The  second  task  concerns  the  determination  of  the  fitness  function.  Those 
chromosomes,  which  represent  fuzzy  rules  that  are  more  consistent  with  the  data,  are 
considered  fitter  then  others.  We  calculate  the  sum  of  squared  deviation  between  the 
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output  of  the  fuzzy  control  with  a  given  set  of  rules  and  a  real  value  of  the  output 
indicated  in  the  data  record.  This  value  represents  a  fitness  function  value  and  is  used 
as  criteria  in  producing  a  new  generation.  In  early  trials  we  used  a  sum  of  modulus 
instead  of  the  sum  of  squares  of  the  difference  between  actual  and  predicted  values  to 
measure  error,  and  obtained  almost  identical  results.  In  order  to  be  able  to  compare 
our  system’s  performance  to  other  techniques  we  preferred  to  use  the  sum  of  squared 
metric  as  the  evaluation  criteria. 

The  crossover  and  mutation  procedures  are  quite  common  for  genetic  algorithms 
and  are  as  follows.  The  current  population  is  ranked  according  to  the  values  of  fitness 
function.  The  probability  for  each  chromosome  to  be  chosen  is  proportional  to  the 
place  of  chromosomes  in  the  ranked  list  of  the  current  population.  Chromosomes  are 
paired  and  either  get  directly  copied  into  a  new  generation  or  produce  a  pair  of 
children  via  a  crossover  operation.  The  newly  produced  children  are  placed  in  the  new 
generation.  The  probability  of  two  parents  crossing  over  is  set  as  a  parameter  of  the 
algorithm.  The  crossover  procedure  can  have  one  or  several  crossover  points.  We 
break  the  parent  chromosomes  in  pre-determined  places  and  mix  the  consequent  parts 
to  build  a  new  child  chromosome. 

The  mutation  process  is  applied  to  each  gene  of  each  chromosome  in  all 
generations.  The  integer  number  in  a  gene  randomly  increases  or  decreases  its  value 
by  one.  This  allows  us  to  represent  new  genes  in  a  population  for  a  given  place  in 
chromosome,  whilst  avoiding  huge  changes  in  the  original  solution  pattern  so  as  to 
adjust  the  solution  toward  a  mutant  in  the  neighbourhood  area. 

The  following  section  presents  an  implementation  and  tests  the  described  method 
with  data  generated  from  known  rules. 


Example  with  generated  data 

In  order  to  test  our  method  we  ran  the  system  over  data  generated  artificially.  By 
defining  the  functional  dependence  between  input  and  output  variables  we  know 
exactly  what  the  derived  fuzzy  rules  should  be. 

Two  inputs  and  one  output  were  used  to  test  the  system.  The  same  four  fuzzy  sets  - 
{Negative  high  (NH),  Negative  low  (NL),  Positive  low  (PL)  and  Positive  High  (PH)} 
were  defined  over  all  system’s  variables  as  shown  in  the  Figure  1.  Thus,  possible 
solutions  representing  the  4x4  fuzzy  rule  table  were  coded  into  16  gene-length 
chromosomes.  The  most  popular  defuzification  method,  centre  of  gravity,  was 
chosen.  The  task  was  to  find  all  rules  in  a  view  {If  xl  is  FuzzySet(i)  AND  x2  is 
FuzzySet(j),  THEN  y  is  FuzzySet(k)},  where  i,j,k==1..4  and  FuzySet(i),  FuzzySetG) 
and  FuzzySet(k)  belong  to  the  set  -  {Negative  high.  Negative  low,  Positive  low  and 
Positive  High). 
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(x) 


Fig.  1.  Fuzzy  sets  for  xl,  x2  and  y 

Artificial  data  was  constructed  as  follows.  One  hundred  random  values  were 
generated  from  the  interval  (-5,5)  for  the  variables  xl  and  x2.  Then  we  put  them 
through  a  pre-defined  function,  y=(10*xl-x2)/l  1  and  stored  output.  The  function 
y=(10*xl-x2)/l  1  has  been  chosen  randomly  only  to  demonstrate  the  method’s 
performance.  We  ran  the  system  with  a  crossover  probability  of  40%,  mutation  rate  of 
0.1%  and  a  population  size  =  50.  The  genetic  algorithm  was  run  50  times  with 
different  initial  populations.  In  100%  of  these  test  simulations  the  search  converged  to 
the  solution  presented  in  Table  3  after  approximately  50  generations. 

Table  3  illustrates  that  the  search  algorithm  finds  a  “very  good”  solution,  for  data 
generated  by  the  function  y=(10*xl-x2)/ll.  As  expected,  the  order  of  fuzzy  outputs 
for  y  in  the  fuzzy  rule  table  decreases  with  an  increase  in  x2  and  increases  with  an 
increase  in  xl.  This  fact  is  consistent  with  the  positive  coefficient  on  variable  xl  and 
negative  coefficient  of  variable  x2.  Moreover,  the  value  of  xl  is  more  significant  in 
determining  an  output  y,  as  we  would  expect  given  the  coefficient  of  xl  is  10  times 
larger  than  that  for  x2  in  the  function.  This  fact  can  be  observed  in  the  first  and  the 
fourth  row,  where  the  values  for  y  are  Negative  high  and  Positive  high  respectively 
regardless  of  the  values  of  xl.  The  rest  of  the  cells  also  confirm  that  positive  values  of 
xl  are  more  prominent  in  determining  the  value  of  y  than  negative  values  of  x2  and 
visa  versa. 


NH 

NL 

PL 

PH 

Xl 

Nff 

Nil 

Nil 

Nil 

Nil 

Ni 

NL 

nL 

PL 

PL 

PL 

PL 

NL 

PH 

Pll 

Pll 

Pll 

Pll 

Table  2.  Fuzzy  rules  for  the  dummy  data 

The  next  section  describes  an  example  of  applying  the  algorithm  to  real  world 
economic  data. 


Example  with  economic  data  looking  for  theoretical  assumptions 
e.g.  keynesian  theory 

In  this  section,  in  order  to  test  the  algorithm  with  real  economic  data,  we  chose 
economic  indicators  with  well-known  interrelationships.  The  Keynesian  General 
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Theory  is  based  on  a  fundamental  assumption  that  the  level  of  national  income 
determines  the  level  of  consumption  [5].  In  his  famous  multiplier  model  he  introduces 
an  increasing  function  C=f(Y),  where  C  is  consumption  and  Y  is  a  national  income. 
This  hypothesis  has  been  quite  successfully  tested  in  many  developed  countries. 

According  to  classical  economic  theory,  interest  rates  impact  on  the  level  of 
consumption,.  Classical  theorists  provide  the  following  reasoning  to  support  this 
hypothesis.  If  the  level  of  interest  rates  rise,  then  people  expect  to  earn  more  money  in 
the  future  on  each  dollar  saved  in  the  present.  More  people  will  therefore  prefer  not  to 
spend  today,  but  wait  for  a  future  time  when  they  will  have  more  to  spend.  Provided  a 
given  level  of  Production  Output  or  National  Income,  more  savings  mean  less 
consumption. 

In  our  study  we  expect  to  find  evidence  for  well-known  associations  depicted  by 
both  Keynesian  and  Classical  theories.  Economic  data,  describing  dynamics  of  these 
indicators  in  the  United  States  was  obtained  from  The  Federal  Reserve  Bank  of  St 
Louise.  The  records  were  collected  on  a  regular  basis  from  1960  till  1997. 

We  compared  our  fuzzy  rules  generation  method  with  linear  regression  and  feed¬ 
forward  neural  network  on  the  same  Federal  Reserve  Bank  data. 

Data  transformation  took  the  form  transforming  actual  quarterly  values  of 
consumption  and  national  income  into  changes  in  those  values  over  a  quarter.  150 
records  representing  change  from  one  quarter  to  the  next  was  collected.  This  data 
allowed  us  to  make  our  system  more  sensitive  to  changes  in  the  modelling  economic 
indicators. 

The  first  input  is  the  change  over  a  quarter  period  of  the  level  of  national  income. 
The  second  input  is  the  change  in  the  interest  rate  over  a  quarter.  The  output  was 
changes  in  the  level  of  real  personal  consumption  over  a  quarter.  The  interval  of  real 
values  of  inputs  and  the  output  were  set  from  minimum  and  maximum  observed 
changes  in  the  corresponding  variables.  The  four  fuzzy  sets  -  {Negative  high  (NH), 
Negative  low  (NL),  Positive  low  (PL)  and  Positive  High  (PH)}  were  set  in  a  manner 
illustrated  in  Figure  1.  The  choice  of  fuzzy  sets  is  supported  by  the  importance  in 
economic  modelling  to  distinguish  between  an  increase  and  a  decrease  in  the  control 
variable,  which  reflected  in  the  negative  or  positive  direction  of  the  changes. 
Furthermore,  it  is  valuable  to  distinguish  between  different  degrees  of  change, 
therefore  high  and  low  fuzzy  sets  are  distributed  over  both  positive  and  negative  sides 
of  the  variables  domain. 

Ten  fold  cross-validation  was  used  with  hold  out  sets  of  size  15  and  training  sets  of 
size  135.  For  each  cross  validation  set,  fuzzy  rules  were  generated  as  described  above. 
The  sum  of  square  of  differences  between  consumption  predicted  by  the  fuzzy  rules 
and  actual  consumption  on  the  test  set  was  recorded.  This  was  repeated  with  a  simple 
linear  regression  model  and  also  with  a  feed-forward  neural  network  trained  with 
back-propagation  of  errors  (3  layer,  learning  rate  =  0.2,  no  improvement  in  error  rates 
after  40-55  epochs).  Table  4  illustrates  the  median,  average  and  standard  deviation  of 
the  sum  of  square  of  the  difference  between  predicted  and  actual  change  in 
consumption  for  the  fuzzy  rules,  neural  network  and  linear  regression  over  the  ten 
cross-validation  sets. 
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I'uzzy  ruies 

Neural  network 

Linear  regression 

Mean 

14.75 

17.31 

23.25 

Std.  Deviation 

5.56 

10.42 

Median 

13.59 

15.4 

21.98 

Table  3.  Comparison  of  fuzzy  rules,  neural  network  and  linear  regression 

The  fuzzy  rules  generated  by  the  genetic  algorithm  method  proposed  here 
performed  very  well  in  comparison  to  a  linear  regression  model.  This  was  perhaps  to 
be  expected  because  the  relationship  between  changes  in  national  income,  interest 
rates  and  consumption  is  expected  to  be  more  complex  than  a  simple  linear  one. 
Neural  networks  can  capture  non-linear  relationships  and  the  networks  trained 
performed  better  than  the  linear  regression  models.  However,  the  performance  of  the 
fuzzy  rules  was  comparable  to  the  trained  networks. 

The  table  of  rules  is  included  in  Table  4  where  Y  is  change  in  national  income  and 
I  is  change  in  interest  rates.  The  fuzzy  rules  predict  change  in  consumption. 


I  1 

NH 

NL 

PL 

PH 

Y 

NH 

PH 

PL 

NH 

NL 

NL 

NH 

NL 

NL 

NL 

PL 

NL 

PL 

PL 

NL 

PH 

PH 

PH 

PH 

NL 

Table  4.  An  optimal  set  of  fuzzy  rules  for  the  data  of  Example  2. 


The  black  box  nature  of  neural  networks  is  a  distinct  disadvantage  for  the  analysis 
of  macro-economic  data.  In  contrast,  as  Table  4  illustrates,  fuzzy  rules  generated 
without  any  theoretical  assumptions  can  be  used  to  explore  patterns  and  to  even  assess 
the  veracity  of  theories.  To  perform  this  assessment  let  us  summarise  search  results  in 
light  of  both  theories.  Firstly,  taking  into  account  that  both  types  of  economists 
usually  assume  consumption  dependencies  close  to  linear  we  can  approximately 
define  them  in  rule  view  as  it  shown  in  Table  5.  Then,  The  Table  6  can  be  interpreted 
as  to  what  degree  it  confirm  either  or  both  theories. 


I 

C 

NH 

PH 

NL 

PL 

PL 

NL 

PH 

NH 

Y 

C 

NH 

NH 

NL 

NL 

PL 

PL 

PH 

PH 

NH 

NL 

PL 

PH 

NH 

Classical 

Classical 

Cl.  &  Kn. 

Cl.  &  Kn. 

Keynesian 

Keynesian 

Cl.  &  Kn. 

Cl.  &  Kn. 

sm 

Cl.  &  Kn. 

Keynesian 

Classical 

PH 

Cl.  &  Kn. 

Cl.  &  Kn. 

Keynesian 

Classical 

Table  5.  Classical  and  Keynesian  consumption  dependecies  Table  6.  Rules  interpretetion 

The  selected  part  of  the  table  is  areas  where  two  theories  do  not  contrivers  each 
other  and  both  theories  are  confirmed  by  the  rules.  In  fact,  according  to  the  table  5, 
when  interest  rates  rise  and  national  income  falls,  then  consumption  shrinks  and,  on 
the  other  comer,  when  interest  rate  fall  and  national  income  rise,  consumption  rises. 
The  (Y-PL,  I-NH)  cell  is  the  only  exception  to  the  theory  predictions  in  these  areas. 

In  the  rest  of  the  table  we  can  observe  that  high  rises  in  interest  rates  make 
consumption  behaviour  classically,  while  under  low  rises  in  interest  rates  it  confirms 
to  Keynesian  theory.  Regarding  national  income,  under  high  decrease  in  national 
income  consumption  reacts  in  classical  manner,  while  low  decreases  in  national 
income  consumption  is  determined  in  Keynesian  way. 
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Conclusions 


In  this  paper  we  demonstrated  an  application  of  fuzzy  logic  to  macro-economic 
modelling.  Despite  benefits,  fuzzy  logic  has  not  been  used  as  widely  as  mathematical 
and  statistical  techniques  for  this  purpose.  Our  use  of  fuzzy  logic  makes  none  of  the 
theoretical  assumptions  normally  made  in  macroeconomics  and  is  more  intuitive.  We 
elicited  fuzzy  rules  directly  from  macro-economic  data  using  a  genetic  algorithm 
search  for  rules  that  best  fit  the  data.  The  technique  was  evaluated  initially  with 
artificially  generated  data  and  then  with  data  from  the  US  economy.  Fuzzy  rules  were 
successfully  discovered  that  described  the  function  used  to  generate  the  artificial  data. 
Furthermore,  fuzzy  rules  generated  from  real  economic  data  provided  a  fine  grained 
analysis  of  economic  activity  and  was  used  to  explore  the  relationship  between  two 
diverse  economic  theories.  The  fuzzy  rules  generated  by  this  approach  were 
compared  for  predictive  accuracy  with  a  linear  regression  model  and  with  a  neural 
network.  The  fuzzy  rules  out-performed  both  approaches. 
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Abstract.  In  the  most  of  adaptive  fuzzy  control  schemes  presented  so  far  still  only  the 
parameters  (weights  of  each  rule’s  consequent),  which  appear  linearly  in  the  radial 
basis  function  (RBF)  expansion,  were  tuned.  The  major  disadvantage  is  that  the 
precision  of  the  parameterized  fuzzy  approximator  can  not  be  guaranteed. 
Consequently,  the  control  performance  has  been  influenced.  In  this  paper,  we  not  only 
tune  the  weighting  parameters  but  tune  the  variances  which  appears  nonlinearly  in  the 
RBF  to  reduce  the  approximation  enor  and  improve  control  performance,  using  a 
lemma  by  Annaswamy  et  al  (1998)  which  was  named  as  concave/convex 
parameterization.  Global  boundedness  of  the  overall  adaptive  system  and  tracking  to 
within  precision  are  established  with  the  proposed  adaptive  controller. 


1.  Introduction 

The  application  of  fuzzy  set  theory  to  control  problems  has  been  the  focus  of  numerous 
studies.  The  motivation  is  that  the  fuzzy  set  theory  provides  an  alternative  way  into  the 
traditional  modeling  and  design  of  control  systems  when  system  knowledge  and 
dynamic  models  in  the  traditional  sense  are  uncertain  and  time  varying.  In  spite  of 
many  successes,  fuzzy  control  has  not  been  viewed  as  rigorous  approach  due  to  the  lack 
of  formal  synthesis  techniques,  which  guarantee  the  basis  requirements  for  control 
system  such  as  global  stability.  Recently,  various  adaptive  fuzzy  control  schemes  have 
been  proposed  to  deal  with  nonlinear  systems  with  poorly  understood  dynamics  by 
using  the  parameterized  fuzzy  approximator  [1-3].  However,  most  of  the  schemes 
presented  so  far  still  only  the  parameters  (weights  of  each  rule’s  consequent),  which 
appear  linearly  in  the  radial  basis  function  (RBF)  expansion,  were  tuned.  The  major 
disadvantage  is  that  the  precision  of  the  parameterized  fuzzy  approximator  can  not  be 
guaranteed,  therefore,  the  control  performance  may  be  affected.  In  the  RBF  expansion, 
three  parameter  vectors  are  used,  which  are  named  as  connection  weights,  variances 
and  centers.  Recently,  very  few  results  are  available  for  adjustment  of  nonlinearly 
parameterized  systems.  Though  the  gradient  approaches  were  used  [4-6],  however,  the 
way  of  fusing  into  the  adaptive  fuzzy  control  schemes  to  generate  a  global  stability  is 
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Still  left  behind.  The  desirable  approach  will  be  apparently  to  tune  the  three  parameter 
vectors  simultaneously.  However,  it  can  definitely  lead  complicated  algorithms  and 
cost  of  calculation.  Since  the  RBF  expansion  is  just  a  kind  of  approximetor  and  nothing 
more,  we  can  refer  to  neural  networks,  which  has  perfect  ability  of  approximation  as 
known.  In  the  neural  network,  it  is  sufficient  to  tune  the  weights  and  variances  in 
general  due  to  improve  the  precision  to  approximation,  whereas  the  centers  are  simply 
placed  on  a  regular  mesh  covering  a  relevant  region  of  system  space.  In  this  paper, 
using  a  lemma  by  Annaswamy  et  al  which  was  named  as  concave/convex 
parameterization  [7],  we  not  only  tune  the  weighting  parameters,  but  tune  the 
variances,  which  appear  nonlinearly  in  the  RBF  to  reduce  the  approximation  error  and 
improve  control  performance.  Global  boundedness  of  the  overall  adaptive  system  and 
tracking  to  within  precision  are  established  with  the  proposed  adaptive  controller. 

2.  Problem  Statement 

This  paper  focuses  our  attention  on  the  design  of  adaptive  control  algorithms  for  a  class 
of  dynamic  systems  whose  equation  of  motion  can  be  expressed  in  the  canonical  form. 
(0  +  (O)  *  h(A:(0rt(0»- '  •  rf  (O)w(O 

where  u(t)  is  the  control  input,  /(•)  and  g(  )  are  unknown  linear  or  nonlinear  function 
and  b  is  the  control  gain.  It  should  be  noted  that  more  general  classes  of  nonlinear 
control  problems  can  be  transformed  into  this  structure  [8]. 

The  control  objective  is  to  force  the  state  ^^(0  =  [^(0r^(0>  •  to  follow  a 

specified  desired  trajectory,  X,  (t)  -  (f), •  Defining  the  tracking  error 

vector,  the  problem  is  thus  to  design  a  control  low  u(t)  which 

ensures  that  (r)  0 ,  as  r  -►  <» .  For  simplicity  in  this  initial  discussion,  we  take  h  =  1 

in  the  subsequent  development. 

3.  Fuzzy  System 

Assume  that  there  are  N  rules  in  considered  fuzzy  system  and  each  of  which  has  the 
following  form: 

R.  :IFx^  is  A]  and  x^  is  A]  and  -  and  is  A] ,  THEN  z  is 
where  ; ,  x,(i  =  l2,  -.n)  are  the  input  variables  to  the  fuzzy  system,  z  is 
the  output  variable  of  the  fuzzy  system,  and  A]  and  Bj  are  linguistic  terms 
characterised  by  fuzzy  membership  functions  (z.)  and  (z) ,  respectively.  As  in 
[2],  we  consider  a  subset  of  the  fuzzy  systems  with  singleton  fuzzifier,  product 
inference,  and  Gaussian  membership  function.  Hence,  such  a  fuzzy  system  can  be 
written  as 
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W  =  (2) 

where  h:UCR"-*R,  X  ’-(x„x^r-.x„)EU ;  (o^  is  the  point  in  R  at 
which^5^((w^)-l,  named  as  the  connection  weight;  (jc,)  is  the  Gaussian 
membership  function,  defined  by 

(x.)  -  exp[-  (oi(x.  -i;))']  (3) 

where  ,  a‘.  are  real-valued  parameters.  Contrary  on  the  traditional  notation,  in  this 
paper  we  use  \jo*.  to  represent  the  variance  just  for  the  convenience  of  later 
development. 

Definition  1:  Define  fuzzy  basis  functions  (FBF’s)  as 

y  =  1.2,-,Af  (4) 

where  are  the  Gaussian  membership  functions  defined  in  (3), 

and  Oj  .  Then,  the  fuzzy  system  (2)  is  equivalent 

to  a  FBF  expansion 

(5) 

Remark:  It  is  obvious  that  g^(*)  is  convex  and  -g^(  )  is  concave  with  respect  to  . 
The  deBnitions  of  the  concave  and  convex  can  be  refer  to  [7]. 

Theorem  1:  For  any  given  real  continuous  function  /  on  the  compact  set  C/ei?"  and 
arbitrary  >  0 ,  there  exists  optimal  FBF  h\X)^A  such  that 

s^|/(A')-/.-(Jf)|<e,  (6) 

This  theorem  states  that  the  FBF  expansion  (5)  is  universal  approximator  on  a  compact 
set.  Since  the  fiizzy  universal  approximator  is  characterized  by  parameters  and 

,  the  optimal  h\X)  contains  optimal  parameters  foj,  a*  and  . 

Without  doubt,  the  desirable  approach  is  to  tune  the  three  parameter  vectors 
simultaneously.  However,  it  can  lead  to  complicated  algorithms  and  cost  of  calculation 
definitely.  Since  in  this  paper,  the  FBF  expansion  is  just  a  kind  of  approximator  and 
nothing  more,  we  can  refer  to  neural  networks,  which  has  perfect  ability  of 
approximation  as  known.  In  the  neural  network,  it  is  sufficient  to  tune  the  weights  and 
variances  in  general  due  to  improve  the  precision  to  approximation,  whereas  the  centers 
are  simply  places  on  a  regular  mesh  covering  a  relevant  region  of  system  space. 

To  guarantee  the  stability  of  proposed  adaptive  fuzzy  system,  we  lead  the  algorithm 
into  a  min-max  problem  in  LP.  Though  solving  the  Min-max  problem  is  an  ordinary 
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problem  in  LP  and  there  are  a  lot  of  approaches  [9-10],  the  most  of  the  approaches  are 
with  a  complicated  procedure.  In  this  paper,  we  use  a  lemma  by  Annaswamy  et  al 
which  was  named  as  concave/convex  parameterization  to  develop  adaptive  fuzzy 
control  system. 


4.  A  Solution  of  Min-Max  Problem 

Let’s  consider  a  scalar  function /((^>(O,0) ,  which  is  continuous  and  bounded  with 
respect  to  its  arguments.  6  is  an  unknown  parameter  vector  and  belongs  to  a  known 
hypercube  ©  G/?" ,  is  a  known  bounded  function  of  X  ,  and  for  any  0(0, 

f(<f)(t),d)  is  either  convex  or  concave  on  ©,,  where  0,  is  a  simplex  in  i?"  such  that 
0,  D  0 .  Suppose  that  vertices  of  0,  are  denoted  as  (/  =l,2,'”,n  ).  Then  0,  may 
be  expressed  as 


0 


a,  =  1 ;  0  ^ a.  s  1 


Theorem  2:  fl'and  k:*  are  the  solutions  of  the  min-max  optimizations  as  follows: 

a  *  «=  min  max  {(0,6) 

toER"  ^€0, 

x  ‘  *=  are  min  max  ^J{(o,d) 

jr(<u,0)  - /(^,0)  - /((^,0) +K:  (e -e) 

where  0e0,  and  p  is  a  known  non-zero  constant.  Then 


,  f^j  if  Pf  is  convex  on  0, 

I  0  if  ff  is  concave  on  0,  ’ 


.  if  is  convex  on  0, 

”  y/*0  (f  concave  on  0, 


where  df  fdB  ,A  =  ^  ’A  ^  scalar,  A2  Gi?" , 


-1 

-1  P{e-e,2y 


-1 


and/„=/(0,0J. 


Remarks: 

1)  The  solution  to  such  an  LP  problem  will  generally  involve  numerical  searches  over 
the  feasible  set  of  solutions.  The  above  Theorem  introduced  the  simplex,  0, , 
precisely  to  avoid  such  a  search. 
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2)  To  decide  the  solutions  in  (11),  the  Theoreni2  requires  that  either  convex  or  concave 
of  ^  would  be  known.  Moreover  is  a  known  non-zero  constant.  This  is  a  strict 
restriction  to  apply  the  theorem  into  some  applications.  Though  the  convex/concave 
of  discussed  function  /  is  known,  however,  the  sign  of  /S  could  not  be  unknown 
generally.  To  deal  with  the  problem,  we  introduce  a  concept  of  one-to-one  mapping 
in  the  next  section. 

We  are  now  ready  to  develop  the  adaptive  fiizzy  control  system  in  which  the  parameters 
(Oj  and  a j  will  be  tuning  and  the  stability  of  system  will  be  ensured. 

5.  Adaptive  Fuzzy  Control  System 

Firstly  applying  the  Theorem  1,  unknown  function  in  (1)  can  be  approximated  by  a 
fuzzy  approximator  f\X), 

Where  oo]  GR  and  a]  -  are  optimal  parameters  for  unknown  function 

f(X)  in  (1).  It  is  reasonable  to  suppose  that  it  exists  a  known  constant  >  0 ,  so  that 

the  approximation  enor  s  ,  defined  as  in  (18),  satisfies  that  |f  |  s  . 

emf(X)-f(X)  (14) 

Normally,  the  unknown  parameters  values  to*  and  o]  are  replaced  by  their  estimates 

and  ,  and  the  estimate  function  /(^)  - 

instead  of  f\X),  is  used  to  approximate  the  unknown  function  f{X) .  The  parameter 

in  the  estimate  function  f{X)  should  then  be  stably  tuned  to  provide  effective  tracking 

control  architecture.  Define  the  estimation  errors  of  the  parameters  as 

a^McrJ-d^  (15) 

As  in  [2],  an  error  metric  is  defined  as 

^+Aj  ^(0  with  A>0  (16) 

which  can  be  rewritten  as  s(t)~A'X(t)  with  A!  • .  The  equation 

5(0  -0  defines  a  time-varying  hyperplane  in  R"  on  which  the  tracking  error  vector 
decays  exponentially  to  zero,  so  that  perfect  tracking  can  be  asymptotically  obtained  by 
maintaining  this  condition.  In  this  case  the  control  objective  becomes  the  design  of 
controller  to  force  5(0  =  0 .  The  time  derivative  of  the  error  metric  can  be  written  as 
i(0  -  (0 + +«  -fiX)  (1'7) 
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where  -  . 

Our  adaptive  control  law  is  now  described  below: 

u(t)  -  -kAt)  *4\t)  -  l^vm  +  /(AT)  -  (o'  +  £,  )sgn(i) 
(U(0  =  -A;i(0g;(aj) 


°i  --PiX^'AO 


(18) 

(19) 

(20) 


where  A^.  and  are  rates  of  adaptation,  a’ Gi?  and 

global  stability  and  be  clear  as  follows. 

Consider  Liapunov  function  candidate 


Time  derivative  of  V  is  given, 


will  lead  to 


(21) 


F(/)=  -kA  (0  -  e  s  (0  -  */  K0|  -  ■'(0  ^  (“J  (sy  (<^/')  -  S  J  (<^;))+  -  O'] ))+  «■  sgn(s) 


(22) 


We  now  consider  three  cases,  (a)  s(t)  =  0 ,  (b)  5(/)  <  0  and  (c)  ^(r)  >  0 ,  and  show  that 
V{t)  s  0  in  the  three  cases. 

(a) i(0-0. 

It  is  clear  that  V{t)  =  0 . 

(b)  s(0<o. 

It  follows  that  V(t)  i  0  if 

(23) 

Therefore,  we  choose 

0  =  (24) 


Since  the  form  of  the  controller  in  equation  (21)  suggests  that  the  quantity  a*  in  like  a 
gain,  we  seek  to  find  an  k'.  so  that  a  is  minimized.  Hence  our  goal  is  to  choose  k\  as 

o'  (25) 


Performing  the  min-max  optimization  to  find  a ‘and  k]  is  needed  to  complete  the 
controller  design.  As  a  PL  problem,  there  are  a  lot  of  algorithms  [9-10]  to  solve  the 
values  of  a‘  and  k*.  ,  however  it  generally  involves  a  numerical  search  over  the  feasible 
set  of  solutions  with  a  complicated  procedure.  Hence,  we  use  the  theorem  2  to  solve  the 
values  of  c*  and  x*  in  (25).  Though  at  first  glance,  the  Theorem  2  looks  like  providing 
a  way  to  solve  the  values,  however,  in  order  to  determine  that  the  function  (o]g.ipj)  is 
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either  convex  or  concave,  the  sign  of  (o]  must  be  known.  Namely,  because  g;(cr^)is 
convex  on  0  ,  so  (o]g^{o^)  will  be  convex  when  co!  aO  ,  and  concave  when  a?*  <0 . 
Hence  it  is  obvious  that  the  Theorem  2  can  not  apply  directly,  since  it  does  not  show  us 
any  clue  on  how  to  determine  the  sign  of  . 

Since  (o]  is  the  optimal  weights  in  (6),  it  is  reasonable  to  assume  that  the  range  of  o)]  is 
known,  i.e.,  co]el(ol^,o)l^].  Now,  we  set  up  a  new  parameter  P;  ,  the 

boundaries  and  are  positive  constants,  which  can  be  chosen  by  the  designer. 
To  deal  with  the  problem  of  sign  of  w*,  we  introduce  the  following  one-to-one 


mapping: 

(o*  ^m+npj ;  Pj 

where 

m  - - Pmin  .  n  -  - - — -  (>  0) 

Pmw  Pmin  P  ni«x  P  mio 

Substituting  (26-27)  to  (25),  we  have 

a '  =  mn  ^  ^  +  nP;  ](g,  i )  -  S;  (<^ j ))+  *■;  (<J /  -  ))  =  “i  +  "i 


(26) 

(27) 

(28) 


where 

a\  -  min  maxy  ^(p,  -Prt.)(g;(c’j)-?;(o,))+K'2"i(<^j 

^2^6^  I7G8, 

The  either  convex  or  concave  of  o)‘„,ngj(^j)  and  n(pj -p^^Jg jCa^)  could  be 
determined  in  both  (29)  and  (30)  via  the  one-to-one  mapping,  since  the  sign  of  w* is 
known  by  the  assumption  and  n(pj  >  0  . 


Now  applying  the  Theorem  2  straightforwardly,  we  can  get  the  solutions  of  a*  , 
a] ,  kIj  and  as  follows, 

“‘"(o  (31) 

where  4  -  [Aj„A,^]  -  ,  4,  is  a  scalar,  4^  £/?*" , 

<^*min(8}-8}sl) 

^mln(.8  j  ""  8  jsl') 

^min  ^8  j  8  jsNn+1 ) 


-1  <wli„(^/  -^JslY 
-1  -^JslY 

^min(^;  ~^JsNn*\) 
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-  ^5;,  ,  k'^,.  -  «{pj  )b^j 

where  ^ji  isascalar,  Bj^SR^", 


-1  n{pj-p„^\aj-ai,y 

■«(pj  -P™n)(i,  -g/,l) 

G,  - 

-1 

"ipj  -p„ii,)(i/-g/.2) 

-1  f^ipi  “ Pmliiy^J  ~^JsNn-*l) 

^iP]  ~  Pttdn\sj  ~gjsNn*l) 

(c)  s(t)  > 

0. 

Similar  argument  to  (b),  it  follows  that  V(t)^0  if 

where 


“^;))] 

“'2  "*;^.^^K'’(P/ -Prnlnhl(.^’l)-Sl<Pt))*>‘‘2I^J  -‘^/))] 

We  can  get  the  solutions  of  a,' ,  a[ ,  and  x-j^.  as  follows, 


a,  -  mm  max 


4i  ^  ® 


'‘“{-‘oLA;, 

where  -[4^,  ,,4^2}^  is  a  scalar,  Aj^^R"" , 


fo'mygje,  a  0 


-1 

-1 

”1 


~^min(g/  S  Isl^ 


a^-O,  =-«(p;-P„„)^g« 


(33) 

(34) 

(35) 

(36) 

(37) 

(38) 

(39) 

(40) 


The  stability  of  the  closed-loop  system  described  by  (1),  (18-20),  (28),  (31-35)  and 
(38-40)  is  established  in  the  following  theorem. 
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Theorem  3:  If  the  robust  adaptive  control  (18-20),  (28),  (31-35)  and  (38-40)  is  applied 
to  the  nonlinear  plant  (1),  then  all  closed-loop  signals  are  bounded  and  ^(0  -*■  0  as 

f  -►  00  . 

6.  Conclusion 

The  novel  feature  of  results  in  this  paper  is  that,  thanks  to  Min-max’s  solutions  of 
Annaswamy  et  al  which  could  simplify  the  procedures  of  our  proposed  algorithm,  a 
new  adaptive  fuzzy  control  law  is  presented.  The  adaptive  fuzzy  control  law  is  capable 
of  stably  turning  the  parameters  which  appear  nonlinearly  in  the  fuzzy  approximator  in 
an  effort  to  reduce  appximation  error  and  improve  control  performance.  The  developed 
controller  guarantees  the  global  stability  of  the  resulting  closed-loop  system  in  the 
sense  that  all  signals  involved  are  uniformly  bounded  and  tracking  to  within  a  desired 
precision.  Hereafter,  we  will  verify  our  theoretic  analysis  by  computer  simlation. 
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Abstract.  This  paper  presents  a  new  approach  to  controlling  chaotic 
systems  using  fuzzy  regulators.  The  relaxed  stability  conditions  and  LMI 
(Linear  Matrix  Inequalities)  based  designs  for  a  fuzzy  regulator  are  used 
to  construct  a  fuzzy  attractive  domain,  in  which  a  global  solution  is 
obtained  so  as  to  achieve  the  desired  stability  condition  of  the  closed-loop 
system.  In  the  control  of  chaotic  systems,  we  use  two-phases  of  control, 
first  phase  uses  an  open-loop  control  with  inherent  chaotic  features  of 
the  system  itself  and  a  fuzzy  model-based  controller  is  employed  under 
state  feedback  control  in  the  second  phase  of  control.  The  Henon  map  is 
employed  to  illustrate  the  above  design  procedure. 


Keywords:  Chaotic  systems,  Fuzzy  model-based  control,  Evolutionary  computa¬ 
tion,  Nonlinear  dynamics,  System  stability,  Lyapunov  function. 


1  Introduction 

Recently,  development  of  chaos  theory  brings  up  scientist  into  a  new  era  in  an¬ 
alyzing  nonlinear  systems.  It  is  known  that  the  chaos  exhibits  a  deterministic 
random  behavior.  Yet  it  needs  more  investigation  on  such  nonlinear  systems  in 
designing  control  algorithms.  Edward  Lorenze,  the  first  experimenter  in  chaos 
was  a  meteorologist  and  described  his  model  of  weather  prediction  phenomena 
with  a  set  of  nonlinear  differential  equations  in  1963.  Among  the  various  meth¬ 
ods  avcdlable  to  analyze  such  nonlinear  systems,  there  exist  fixed  point  analysis, 
linearization,  Pioncare  map,  Lyapunov  method,  spectral  analysis,  fractal,  chaos 
etc.  [1],[2],[8].  Employing  intended  deterministic  chaos  to  control  nonlinear  dy¬ 
namical  systems  is  an  interesting  in  the  development  of  control  theory  [1].  In 
this  method,  it  is  proposed  that  in  a  nonlinear  dynamic  system,  a  chaotic  at¬ 
traction  can  be  formed  by  an  appropriate  usage  of  open-loop  control  until  the 
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system  states  converge  within  a  specified  area  of  attractive  domain  and  then 
a  state  feedback  control  can  be  used  to  converge  the  system  states  to  desired 
values.  In  this  approach,  the  usage  of  state  feedback  control  involves  finding  a 
largest  level  curve  based  on  Lyapunov  stability  condition  in  advance.  However 
this  optimization  problem  will  generally  have  multiple  local  minima.  Therefore 
to  find  a  global  minimum,  almost  all  such  local  minima  have  to  be  identified 
with  trial  and  error.  That  implies  the  complexity  of  this  approach. 

Recent  advances  in  LMI  (Linear  Matrix  Inequalities)  theory  [3], [4], [5], [10] 
allowed  to  handle  nonlinear  control  system  problems,  via  semi-definite  program¬ 
ming.  For  an  example,  stability  condition  is  guaranteed  by  the  well-known  Lya¬ 
punov  approach  [7]  for  fuzzy  model-regulators  and  the  LMI  tool  searches  the 
solution  space  subjected  to  various  constraints.  In  particular,  Takagi-Sugeno 
(T-S)  fuzzy  model  plays  an  important  role  in  designing  such  fuzzy  regulators 
[4]^[5],[7].  In  fact,  the  fuzzy  model-based  control  (FMC)  can  be  applied  very  well 
to  nonlinear  dynamic  systems,  this  attempt  also  implies  the  ability  of  controlling 
chaotic  systems  via  FMC. 

In  the  control  of  chaotic  systems,  there  are  two  phases  of  control.  First  phase 
uses  an  open-loop  control  such  that  the  inherent  chaotic  features  attract  the 
states  to  a  desired  area  as  studied  in  [1].  Once  the  system  has  entered  a  specified 
area,  open-loop  control  is  cut  off  and  the  second  phase  of  control  is  adopted.  In 
the  proposed  second  phase,  an  FMC  is  employed  under  state  feedback  control,  it 
has  been  implemented  by  employing  a  set  of  IF-THEN  fuzzy  rules  and  assuring 
the  global  stability  of  domain  constructed  by  the  total  FMC.  For  this  purpose, 
the  state  feedback  gain  scheduling  of  the  control  system  in  the  second  phase  is 
achieved  by  solving  a  set  of  LMIs  via  an  optimization  technique  based  on  evolu¬ 
tionary  computation.  The  rest  of  the  paper  is  organized  as  follows:  In  section  2, 
introduction  of  dynamic  system  modeling  and  FMC  are  reviewed.  The  proposed 
fuzzy-chaos  hybrid  control  scheme  is  presented  in  section  3.  Finally,  the  chaotic 
system,  Henon  map  is  taken  into  consideration  to  illustrate  the  design  procedure 
and  a  simulation  result  of  the  fuzzy  model-based  regulator  is  presented  in  section 
4. 

2  System  Modeling 

Inherent  chaotic  characteristics  can  be  useful  in  moving  a  system  to  various 
points  in  state  space.  In  the  proposed  method,  this  feature  is  used  to  drive 
the  system  states  to  a  pre-defined  domain  C.  An  appropriate  open-loop  con¬ 
trol  (OLC)  input  can  be  employed  to  create  chaos  or  to  use  chaos  in  a  nonlin¬ 
ear  system  itself.  Once  it  reaches  to  the  pre-defined  fuzzy  attractive  domain,  a 
fuzzy  model-based  controller  (FMC)  is  employed  under  state  feedback  control 
to  achieve  desired  target.  This  design  concept  is  schematically  given  in  Fig.  1  for 
a  two  dimensional  case.  Here  it  is  intended  to  drive  the  system  state  Pi  to  P3. 
The  feedback  controller  design  is  based  on  multiple  linearizations  around  a  sin¬ 
gle  equilibrium  point,  i.e.,  so  called  off-equilibrium  linearizations.  It  is  known  [7] 
that  the  method  will  significantly  improve  the  transient  dynamics  of  the  control 
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system  for  a  general  control  problem.  Rather,  it  is  interesting  to  note  that  such 
a  technique  is  useful  for  constructing  a  globally  stable  fuzzy  attraction  domain 
without  trial  and  error. 


Fig.  1.  Fuzzy-chaos  hybrid  control  scheme 


2. 1  Off-equilibrium  Linearization 

Nonlinear  dynamic  continuous  time  systems  (CS)  can  be  described  by  nonlinear 
differential  equations  [1],[2]  (or  difference  equations  for  discrete  time  systems 
(DS))  as 


x  =  F{x^u)  for  CS 

x{t-^\)  =  F{x,u)  forDS  (1) 

where  x^BF  is  the  state  vector  and  gives  the  control  input  vector  of  the 

systems.  The  equilibrium  points  {x^u)  (or  fixed  points)  of  the  dynamic  system 
satisfy 


£  —  {{x^u)eB!^'^^  I  F(x,u)  =  0  }  for  CS 
£  =  I  X  =  F{xju)  }  for  DS  (2) 

In  this  paper,  it  is  proposed  to  select  a  suitable  set  of  off-equilibrium  points 
such  that  all  the  subsystems  compose  a  convex  region  which  keeps  the 

equilibrium  point  (»,it)G(7.  More  generally,  n-dimensional  state  space  system 
needs  at  least  n  -f-  1  off-equilibrium  points  which  represent  the  convex  region  to 
ensure  the  stability  of  a  particular  subsystem. 

Neglecting  higher  order  terms,  we  obtain  a  linearized  model  around  any  ar¬ 
bitrary  point  (a;o,^*o)^  C  as  follows: 


X  =  Ao(x  -  xo)  -hBo(u-  uo)  +  F(xo,uo)  for  CS 
x(t  +  1)  =  Ao(x  -  Xo)  4-  Bo(u  -  Uo)  +  F(xo,uo)  for  DS 


(3) 
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where 


-An  — 


dx 

du 


(xo,uo) 

(»o,«o) 


For  an  example,  two  dimensional  state  space  model  needs  three  ofF-equlibrium 
points  such  that  the  equilibrium  point  lies  on  the  center  of  mass  of  an  equilateral 
triangle  keeping  its  corners  on  the  three  off-equilibrium  points. 

2.2  Fuzzy  Models  and  Regulators 

The  dynamics  of  the  nonlinear  system  are  approximated  near  an  arbitrary  point 
(a;o,tto)€  C.  Then,  equation  (3)  can  be  rewritten  in  the  form: 

X  —  A-qx  Bqu  -\-  do  for  CS 
x(i  +  1)  =  Aqx Bqu  +  do  for  DS  (4) 

where  do  =  F[xo,  uq)  —  AqXq  —  BqUq.  Note  here  that  an  arbitrary  point  (aJoj'U-o) 
need  not  be  an  equilibrium  point  (®,u). 

Fuzzy  models  due  to  Takagi-Sugeno  consist  of  a  set  of  IF-THEN  rules  [7]  for 
the  above  approximate  systems.  The  ith  plant  rule  of  each  subsystems  for  both 
continuous- time  and  discrete- time  fuzzy  systems  is  given  by 


IF  z\{t)  is  Mil  and... and  z^{t)  is  Mip 

x{t)  =  Aix(t)  -I-  Biu{t)  +  di  for  CS 
THEN  <  x{t  +  1)  =  Aix{t)  -f  Biult)  -f  di  for  DS 
y(t)  =  Cix{i)  i  =  1, .. 


(5) 


where  r  is  the  number  of  fuzzy  rules  and  Mij  (i  =  l,...,r  and  j  are 

the  fuzzy  sets.  The  state  vector  is  x{t)  €  BP,  input  vector  is  u{t)  €  and  the 
output  vector  is  given  by  y(t)  €  B^.  Ai,Bi  and  Q  are  the  system  parameter 
matrices  and  di  is  the  offset  term  of  the  zth  fuzzy  model.  For  a  given  state, 
zi(t),  are  the  premise  variables  (or  antecedent  inputs). 

Subjecting  to  the  parallel  distributed  compensation,  we  can  design  the  fol¬ 
lowing  fuzzy  regulators: 

Regulator  Rule  i  : 

IF  zi{t)  is  Mil  and. ..and  Zp{t)  is  Mip 

THEN  U(t)  =  -Ki[x{t)  -  Xr]‘\-Ur,  i  =  I,  ...,7* 

for  the  fuzzy  models  (5),  where  Xr  is  a  state  reference  trajectory,  Ur  is  the 
corresponding  input  trajectory,  and  Ki  is  the  local  feedback  gain  matrix.  Thus, 
the  fuzzy  regulator  rules  have  linear  state-feedback  laws  in  the  consequent  parts 
and  the  overall  fuzzy  regulator  can  be  reduced  to 

r 

u[t)  =  — ^  ]  hi{z{t))Ki[x[t)  —  Xr\  -{-  Ur 

t  =  l 


(6) 
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where 


z(t)  =  [zi(<),...,2p(<)] 

j—l  ~~ 

for  all  i,  in  which  Mij(zj{t))  denotes  the  confidence  (or  grade)  of  membership 
of  Zj{t)  in  Mij  . 


3  Fuzzy-Chaos  Hybrid  Controller 

In  order  to  control  the  original  nonlinear  system  with  a  chaotic  input  in  the 
open-loop  system  or  a  fuzzy  controller  in  the  closed-loop  system,  a  fuzzy-chaos 
hybrid  control  scheme  is  proposed  here.  Such  a  control  scheme  can  be  considered 
in  two  CcLses,  depending  on  the  choice  of  equilibrium  points  as  the  reference. 


3.1  Stabilization  of  a  Prespecified  Equilibrium  Point 

In  this  case,  the  fuzzy-chaos  hybrid  control  can  be  implemented  by 

r 

IF  ^Wi(z(I))  =  0  THEN 
*  =  1 

u(t)  =  u{t) 

ELSE  (7) 

r 

u(t)  =:  hi{z{t))Ki[x{t)  -  +  « 

i=l 

where  u(t)  is  an  open-loop  input  to  make  the  original  nonlinear  system  chaotic 
and  (x,u)  is  the  prespecified  equilibrium  point  which  would  be  stabilized. 


3.2  Stabilization  of  any  Equilibrium  Point 

If  the  stabilized  equilibrium  point  is  arbitrary  among  all  the  equilibrium  points, 
the  above  fuzzy-chaos  hybrid  control  can  be  modified  as  follows. 

r 

IF  {z{t))=0  THEN 

i=l 

u(t)  =  u(t) 

ELSE  (8) 

*max  =  m3ix{hi{z{t)),  ...,hr{z{t))} 
u{t^  =  “  ®max]  "b  '*^max 
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where  imax  denotes  the  rule  number  that  has  largest  rule  confidence,  is  the 

corresponding  feedback  gain  matrix,  and  (®max, “max)  is  an  equilibrium  point 
existing  in  the  fuzzy  attractive  domain  constructed  by  using  the  imax-th  rule. 

Gain  scheduling  of  the  feedback  controller  is  determined  by  employing  a  set 
of  LMIs  [3]  and  eigenvalue  minimization  algorithm  was  developed  to  determine 
the  positive  definite  and  positive  semi-definite  matrices  associated  with  various 
linear  matrix  inequalities,  using  evolutionary  computation  technique  by  making 
a  penalty  for  an  individual  which  violates  the  inequality  condition.  This  opti¬ 
mization  problem  can  be  also  efficiently  solved  by  means  of  recently  developed 
interior-point  methods  [5]. 

4  Design  Example  and  Results 

4.1  Henon  Map 

In  this  example,  the  chaotic  system,  Henon  map,  is  presented  to  illustrate  the 
proposed  design  procedure.  The  nonlinear  dynamic  equations  of  the  Henon  map 
are  given  by 


xi(i  H"  1)  =  —lAxl  -f  aJ2  +  1 

X2{t  4- 1)  =  0.3iCi  (9) 

Fixed  points  of  the  system  of  difference  equations  (9)  are  satisfied  as  the  equation 
(2),  resulting  two  fixed-points  (0.6314,  0.1894)  and  (—1.1314,  —0.3394).  There¬ 
fore,  we  can  design  two  convex  regions  which  correspond  to  two  fixed-points. 
Here  we  select  a  triple  point  sub-region  (yl,a;(<)  -f  Biu{i)  for  i  =  1,2,3)  such 
that  it  surrounds  the  fixed-point  [x^  =  0.6314,  =  0.1894)  as  shown  in  Fig.  2. 

The  equilibrium  point  lies  on  the  center  of  mass  of  an  equilateral  triangle  having 
the  coordinates  (x^i,  X6i),  (xa,  Xb2)  and  (x^a,  Xbi)  in  order  to  determine  the 
common  P  and  common  Q.  The  open-loop  control  input  to  the  system  (9)  u  is 
selected  as  in  equation  (10)  in  implementing  the  desired  control  system. 

xi(t  +  1)  =  -lAxl  -f  X2  +  1  +  w 

aj2(i  +  l)  =  0.3x1  (10) 


Three  linearized  models  corresponded  to  the  first  fixed-point  are  given  below 
taking  (xa  —  x^i)  =  0.2. 

r 

0 


= 


-1.4878  1 
0.3  0 


,B,= 


A").  — 


1.7678 
0.3  0 


1 

OJ  ’ 


A^  — 


-2.0478  1 
0.3  0 


,^3  = 


1 

0 
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Fig.  2.  Tripple  point  sub-system  (i  =  1, 2,3)  and  its  membership  fimctions 

The  same  procedure  is  repeated  to  select  the  next  sub-region  (Aix(t)-{-Biu(t)  for  i  = 
4, 5, 6)  around  the  second  fixed-point  and  the  three  linearized  models  are  given 
below. 


4.2  Calculation  of  Feedback  Gains 

Gain  scheduling  of  the  above  problem  can  be  formulated  as  an  optimization 
problem  with  the  LMIs  [3]  and  it  is  solved  by  using  an  optimization  technique 
based  on  evolutionary  computation  [9].  Here  we  obtain  the  common  P  and  com¬ 
mon  Q  for  the  first  fixed-point  guaranteeing  the  stability.  We  obtained  the  Pi 
and  Qi  as  follows  at  /?  =  0.6020  (s  =  3): 

_  292.519  39.7689]  ^  _  [13.8323  15.8108' 

"  [39.76891  535,242j  ’  ”  [l5.8108  142.855 

Similarly,  P2  and  Q2  matrices  associated  with  the  second  fixed-point  can  be 
obtained  as  follows  at  /?  =  0.9581  (s  =  3): 

[633.277  29.3221  ]  _  [55.4408  30.5980’ 

"  [29.3221  656.3649J  ’  “  [30.5980  76.9469 

The  gains  Ki,  K 2  and  Ks  are  obtained  as  below  providing  the  global  stability 
of  the  fuzzy  control  system  for  the  first  sub-region, 


Ki=z[~0.mS  1.0329],  i^2  =  [-1.2171  0.7717] 
JiTa  =  [-2.1002  0.5644] 
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Similarly,  the  gains  K4,  and  Ke  are  obtained  for  the  second  sub-region  as 
follows: 


K4  =  [3.4272  0.7360] ,  =  [3.8298  1.5356] 

^6  =  [2.6047  0.8615] 

Based  on  the  methodology,  as  proposed  in  section  2,  the  Henon  map  is  controlled 
by  a  fuzzy-chaos  hybrid  controller.  Since  the  system  has  been  already  a  chaotic, 
it  can  be  used  for  the  first  phase  of  control  with  no  input  {u  —  0).  Once  the 
system  states  reach  to  one  of  the  above  two  sub-domains,  fuzzy  controller  will 
drive  the  system  towards  the  fixed  point.  For  example,  by  using  the  above  gains 
Ki  {i  =  1,2,3),  the  fuzzy  controller  is  constructed  from  the  following  IF-THEN 
rule  base: 

Ri  :  IF  is  PS  AND  X2  is  PS 
THEN  K  =  Ki 
R2  :  IF  xi  is  P  AND  X2  is  PB 

THEN  K  =  K2  ^  ^ 

R3  :  IF  xi  is  PB  AND  X2  is  PS 
THEN  K  =  K3 

where  PS,  P  and  PB  represent  the  words  positive  small,  positive  and  positive 
big  respectively.  In  order  to  verify  the  above  design  procedure  in  sections  2  and 
3,  the  proposed  fuzzy-chac^  hybrid  controller  is  applied  to  the  chaotic  system 
(9).  Here  we  allow  the  system  to  drive  its  states  towards  one  of  the  above  two 
fixed  points  chaotically.  The  rule  base  (8)  was  employed  here  to  construct  the 
controller  as  follows: 

IF  =  0 

i=l 

u  =  u  =  0 

ELSE 

u  =  [a  -  (i  =  1,  ...,6)  (12) 

where  is  the  gain  which  corresponds  to  {i  —  1,...,6).  Figure  3 

shows  the  resulting  trajectory  of  the  chaotic  system  controlled  by  the  proposed 
controller  starting  from  (—0.3,0). 

5  Conclusions 

Inherent  chaotic  features  have  been  used  to  drive  the  system  states  to  a  prede¬ 
fined  domain  using  an  OLC.  Once  it  reaches  to  the  predefined  fuzzy  attractive 
domain,  a  fuzzy  model-based  controller  is  employed  under  state  feedback  control 
to  achieve  reference  target.  Eigenvalue  minimization  algorithm  was  used  to  de¬ 
termine  the  positive  definite  and  positive  semi-definite  matrices  associated  with 
various  LMIs,  using  evolutionary  computation  technique. 
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Fig.  3.  Trajectory  of  the  controlled  Henon  map  starting  from  (—0.3, 0). 


In  the  example,  the  Henon  map  was  driven  by  chaotic  system  itself  before 
reaching  the  fiizzy  attractive  domain  with  no  control  input.  Once  it  reaches  to 
one  of  the  two  fuzzy  domains,  a  fuzzy  model-based  controller  drives  the  system 
towards  the  equilibrium  point.  The  simulation  result  has  shown  the  good  tracking 
performance  of  the  proposed  controller  in  spite  of  the  uncertainties  of  the  chaotic 
system.  Thus,  the  proposed  methodology  is  useful  for  the  design  of  nonlinear 
control  systems  which  exhibit  deterministic  random-behaviors. 
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Abstract.  The  behavior  based  approach  has  been  actively  used  in  many 
applications  of  intelligent  robots  due  to  the  advantages  of  dividing  the 
control  system  according  to  the  task  achieving  behaviors  over  the  conven- 
tioncJ  method  in  which  the  division  is  based  on  functions.  One  important 
application  that  had  been  done  is  for  a  mobile  robot  to  reach  a  t^get 
while  avoiding  obstacles.  The  objective  of  this  paper  is  for  a  multi-link 
manipulator  to  reach  a  target  while  avoiding  obstacles  by  using  a  fuzzy 
behavior-bcised  control  approach.  The  control  system  that  had  been  ap¬ 
plied  to  the  mobile  robot  in  the  previous  work,  is  modified  to  suit  to 
the  manipulator.  Fuzzy  behavior  elements  are  trained  by  a  genetic  algo¬ 
rithm.  An  additioncil  component  is  also  introduced  in  order  to  overcome 
the  gravitational  effect.  Simulation  results  show  that  the  manipulator 
reaches  the  target  with  an  acceptable  solution. 


Keywords:  Manipulator,  Fuzzy  control,  Behavior-based  control  system,  Obstacle 
avoidance.  Genetic  algorithms 


1  Introduction 

Brooks  [1]  proposed  a  new  architecture  called  “Subsumption  Architecture”  for 
controlling  a  mobile  robot.  Layers  of  control  system  were  built  to  let  the  robot 
operate  at  increasing  level  of  competence.  Decomposition  of  the  control  system 
was  based  on  task  achieving  behaviors.  This  behavior-based  control  has  been 
actively  applied  to  several  intelligent  robots  [2-6].  Watanabe  and  Isumi  [5]  stud¬ 
ied  a  fuzzy  behavior-based  control  system  for  a  mobile  robot  by  applying  the 
concept  of  subsumption-like  architecture  using  soft  computing  techniques,  in 
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which  a  simple  fuzzy  reasoning  was  assigned  to  one  elemental  behavior  consist¬ 
ing  of  a  single  input-output  relation,  and  then  two  consequent  results  from  two 
behavioral  groups  were  competed  or  cooperated. 

On  the  other  hand,  Rahnanian-shahri  and  TYoch  [7]  presented  a  new  method 
to  on-line  collision- avoidance  of  redundant  manipulators  with  obstacles.  Ding 
and  Chang  [8]  introduced  a  real-time  planning  algorithm  for  avoidance  of  re¬ 
dundant  robots  in  collision-free  trajectory  planning.  Nearchou  and  Aspragathos 
[9]  presented  an  algorithm  for  Cartesian  trajectory  by  redundant  robots  in  envi¬ 
ronments  with  obstacles.  It  should  be  noted  that  all  the  above  mentioned  works 
are  based  on  trajectory  planning  while  avoiding  obstacles,  but  none  of  them  is 
based  on  behavior-based  control  strategy. 

A  fuzzy  behavior-based  control  method  developed  in  references  [4]  and  [5] 
is  used  to  control  a  multi-link  manipulator  to  reach  a  target  while  avoiding 
obstacles  in  this  work.  The  basic  concept  used  for  the  mobile  manipulator  [4, 5] 
is  applied  with  some  modifications.  Thus,  a  fuzzy  behavior-based  control  system 
is  applied  to  three-degree-of-freedom,  three-link  manipulator  to  reach  a  target 
from  a  given  point  while  avoiding  obstacles. 

2  Three-Link  Manipulator 

It  is  assumed  that  the  robot  has  three-degree-of-freedom  and  it  moves  in  a  two 
dimensional  vertical  plane.  The  axes  are  selected  such  a  way  that  O  —  X  —  Y  is 
the  vertical  plane  where  the  center  of  gravity  acts  opposite  to  the  O  -V  axis 
and  O  —  Z  is  selected  according  to  the  right  hand  rule.  Let  [fx  fy  fz]^  and 
[ux  riy  Uz]^  be  the  force  vector  and  the  moment  vector  at  the  end-effector  of  the 
robot,  where  the  subscripts  x,  y  and  z  are  used  to  represent  O  —  X  axis,  O  —  Y 
axis  and  O  —  Z  axis  respectively.  Since  the  robot  is  moving  in  O  —  X  —  Y  plane, 
fz ,  Tlx  and  Uy  are  zero.  Link  coordinate  axes  O,-  —  Xi  —  Yi  are  defined  in  such  a 
way  that  the  origin  of  each  link  coordinate  system  is  selected  at  the  end  of  the 
respective  link  and  the  O*  -  Xi  axis  is  selected  along  the  link.  Oi  —  Yi  axis  is 
perpendicular  to  the  Oi  —  Xi  axis  in  the  counterclockwise  direction  and  Oi  —  Zi 
axis  is  selected  according  to  the  right  hand  rule. 

Length,  joint  angle  and  mass  for  link  i  are  denoted  by  /j,  Oi  and  respec¬ 
tively.  Xi  and  Jixx,  Vi  aJid  /,yy,  and,  Zi  and  Uzz  are  the  center  of  gravity  and  the 
moment  of  inertia  for  link  i  in  Oi  -  Xi,  Oi  -  Yi  and  Oi  -  Zi  directions  respec¬ 
tively.  It  is  assumed  that  all  links  are  homogenous  and  the  center  of  gravity  acts 
in  the  middle  and  each  link  is  symmetrical  about  its  center  of  gravity.  Therefore 
Xi  =  \ij^,  yi  ~  Zi  ~  0.0,  lixx  ~  0.0,  and  liyy  —  lizz  —  ffiiX;^  jYl, 

To  simulate  this  model  on  a  computer,  dynamic  equations  for  this  manipu¬ 
lator  are  derived  using  Newton-Euler  method  [11].  The  dynamic  equations  are 
given  by  Eq.  (1)  in  the  matrix-vector  form: 

T{i)  =  +  +5W))  (1) 

where  D(0{t))  is  the  3x3  inertia  matrix;  h(9{t),0(t))  is  the  3x1  Coriolis  and 
centrifugal  force  vector;  and  g(9{t))  is  the  3x1  gravitational  vector. 
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3  Behavior  Model  for  the  Manipulator 


Objective  Behavior  Gioiq) 


Figure  1  shows  the  behavior-based  control  system  consisting  of  three  behavior 
groups  for  the  manipulator  with  the  higher  level  behavior  group  shown  over 
the  lower  behavior  group.  Inputs  for  the  objective  behavior  group  are  Dy 
and  where  and  Dy  are  the  distances  between  the  end-effector  point  and 
the  target  in  O  -  X  direction  and  O  -Y  direction  respectively,  and  ^  is  the 
relative  angle  between  the  moving  direction  of  the  end-effector  point  and  the 
objective  point.  Inputs  to  the  reactive  behavior  group  are  Vy  and  Uz,  where 
and  Vy  are  the  velocities  at  the  end-effector  point  in  O  —  X  direction  and 
O  —  Y  direction  respectively,  and  Uz  is  the  angular  velocity  of  the  end-effector 
point.  Va:,  Vy  and  Uz  must  be  calculated  by  using  the  Jacobian  matrix  with  the 


angular  velocities  of  link  1,  link  2  and  link  3  (^i,  02,  Os)-  All  the  obstacles  are 
assumed  to  be  circles  in  the  O  -  A  -  Y  plane  and  they  are  sensed  by  a  CCD 
camera.  It  is  assumed  in  this  work  that  the  camera  vision  system  can  process 
its  data  and  the  center  coordinates  and  the  radius  of  each  obstacle  are  known. 
The  center  coordinate  and  radius  of  the  j-th  obstacle  are  denoted  by  (xcj,  Vcj) 
and  rj  respectively.  Together  with  the  obstacle  data,  if  the  angular  positions 
of  link  1,  link  2  and  link  3  (^i,  $2,  Os)  are  known,  Sij,  the  minimum  distance 
from  each  link  i  to  every  obstacle  j  can  be  calculated.  Once  all  Sij  values  are 
calculated,  the  respective  dij,  the  minimum  distance  from  link  i  to  obstacle  j  can 
be  obtained  by  dij  =  Sij-rj.  Thereafter  d,  the  minimum  within  all  the  minimum 
distances  from  the  each  link  to  the  every  obstacle,  can  be  found  and  afterthat 
Ip,  the  relative  angle  between  the  respective  link  and  the  vector  d  measured  in 
counterclockwise  direction,  and  S,  the  angle  of  the  distance  vector  d  with  respect 
to  the  O  —  X  axis  of  the  base  coordinate  system,  are  calculated.  The  inputs  to 
the  reactive  behavior  group  are  given  by  dj;,  dy  and  fp,  where  d^  =  dcos{S)  and 
dy  =  dsm{S).  For  each  behavior  group  the  output  variables  are  the  force  required 
in  O  -  X  direction,  the  force  required  in  O  -  Y  direction  and  the  moment  in 
O-  Z  direction  in  the  respective  order.  Their  output  vectors  are  represented  by 
[fio  fyo  Kof,  Ifsf  fyf  and  [/;,  f*^  for  the  objective  behavior  group, 

free  behavior  group  and  reactive  behavior  group  respectively.  Forces  and  moment 
in  the  absolute  coordinate  systems  F  =  fy  ,  are  obtained  by  fusing  the 
reasoning  results  generated  from  each  behavior  group  through  the  nonlinear 
suppression  unit  with  S  [4, 5].  Fuzzy  rule  relations  and  fusion  of  behavior  groups 
are  later  explained  in  Section  3.2.  To  transform  the  above  quantity  to  the  torque 
in  the  joint  coordinate  system,  the  following  Jacobian  transpose  is  used  [10]: 

T*  =  (2) 

where  t*  =  [rj^  rJ  73]^  is  the  torque  vector  obtained  as  if  there  are  no  grav¬ 
itational  effects.  When  considering  the  additional  gravitational  torque  vector 
At  =  [Ati  At2  Ats]'^  ,  which  is  explained  in  Section  3.1,  the  final  output  torque 
vector  T  =  [ti  T2  73]^  is  given  by 

T  =  T*  +  At  (3) 

3.1  Compensation  of  the  Gravitational  Effect 

Each  behavior  group  controls  the  movements  of  the  robot.  For  example,  in  the 
objective  behavior  group,  the  first  input  element  Dx  must  be  reduced  in  order 
to  reach  the  target.  This  is  obtained  by  controlling  the  output  element  fx-  Since 
none  of  the  behaviors  does  not  consider  the  gravitational  effect,  the  required 
torque  to  overcome  the  gravity  is  added  to  the  output  torque  from  the  fuzzy 
behavior-bcLsed  control  before  applying  to  the  manipulator  and  it  is  assumed  in 
this  work  that  the  length,  mass  and  center  of  gravity  of  the  links  are  available. 
The  required  torque  equations  for  this  calculation  are  given  by 

An  =  niig(li  +  xi)ci  -|-  m2g[liCi  +  (h  +  ^2)^12] 

■i-msglhci  +  /2C12  +  (h  +  ^3)^123] 


(4) 
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(5) 

(6) 


At2  =  m^gih  H-  ^2)012  +  msglhcn  +  (^3  +  ®3)ci23] 

At3  =  msgih  +  ^3)^123 

where,  ci  =  cos(^i),  C12  =  cos(^i  +  62),  C123  =  cos(^i  +  ^2  +  ^3)>  and  ^  is  the 
gravitational  acceleration.  However  for  a  robot  moving  in  a  horizontal  plane  this 
additional  torque  is  not  required. 

3.2  Fuzzy  Rule  of  a  Behavior  Group 

A  simple  fuzzy  reasoning  is  applied  to  one  behavioral  element  using  one  sensor 
information  or  processed  sensor  information  y,  and  it  generates  one  reasoning  re¬ 
sult  u*.  Gaussian-type  function  is  used  as  the  membership  function,  and  uses  the 
simplified  fuzzy  reasoning.  The  resultant  fuzzy  reasoning  consequent  is  obtained 
as 

M 

u*  =  (7) 

*=1 

where  M  denotes  the  total  number  of  rules,  wu  the  constant  in  the  conclusion 
of  the  i-th  rule,  and  p,-  the  normalized  rule  confidence  such  as 


yi  =  exp{(ln(0.5)(2/  -  Wcj)^wli]  (9) 

where  Wd  denotes  the  center  value  associated  with  the  i-th  membership  function, 
and  Wdi  denotes  the  reciprocal  value  of  the  deviation  from  the  center  Wd  to  which 
the  2-th  Gaussian  function  of  the  input  data  on  the  support  set  has  value  0.5. 
Consider  two  behavior  groups  i  and  2  +  1,  where  the  behavior  group  2  represents 
the  lower  behavior  group  and  the  behavior  2  +  1  represents  the  higher  behavior 
group.  Let  the  two  same  outputs  from  two  behavior  groups  be  described  as  a 
and  b  from  behavior  group  i  and  behavior  group  2  +  1,  respectively.  The  fusion 
result  is  given  by  c. 


c  =  (1  —  s)a  +  sb 

(10) 

with 

s  =  |sat(a)  +  sat(6)|/2 

(11) 

Here  the  saturation  function  “sat”  is  given  by 

(12) 

where  e  denotes  a  positive  number. 
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4  Learning  Using  Genetic  Algorithm 

For  the  tournament  selection,  the  fittest  individual  is  selected  as  the  parent  out  of 
the  three  individuals.  Two-point  crossover  is  used  with  a  crossover  probability  of 
0.6.  One  individual  has  135  training  parameters,  because  rip  =  Nj  xNrX NeXNb^ 
where  rip  denotes  the  number  of  training  parameters,  Nj  the  fuzzy  parameters 
[wci,  Wdi,  Wbi),  Nr  the  number  of  rules  (5),  Ne  the  number  of  elements  in  one 
group  (3),  and  Nb  the  number  of  behavior  groups  (3).  Since  one  parameter  is 
represented  by  an  8  bit  string,  the  code  length  is  1080  (i.e.,  135  x  8).  The  number 
of  individuals  in  one  generation  is  60,  the  elite  number  is  8,  which  is  kept  for 
the  next  generation,  and  the  number  of  parents  selected  is  52,  which  generate 
52  offspring.  The  mutation  rate  is  equal  to  the  1/code  length  which  is  1/1080. 


4.1  Fitness  Function 

One  individual  run  of  the  manipulator  is  over  if  one  or  more  links  go  out  of 
range  or  any  of  the  link  collides  with  the  obstacle  or  the  given  time  is  over  or 
the  end-effector  point  moves  0.5  [m]  away  from  the  minimum  distance  between 
the  end-effector  point  and  the  target  during  the  run  or  the  end-effector  point 
reaches  the  target  successfully.  Fitness  value.  Fitness,  is  given  by 

Fitness  =  —dmin  +  (Fq  —  dmin)lTD  —  3.0  X  No  ~  0.1  x  Nc  (13) 

where  dmin  is  the  minimum  distance  between  the  end-effector  of  the  robot  and 
the  target  during  the  run.  Do  is  the  initial  distance  between  the  end-effector 
and  the  target,  and  Tp  is  the  travel  distance  of  the  end-effector  point  during  the 
period  of  that  run.  Here  No  is  equal  to  the  number  of  links  went  out  of  range  and 
Nc  is  equal  to  the  number  of  links  collided  with  the  obstacle.  The  objective  of 
the  optimization  is  to  minimize  the  distance  of  travel  to  an  acceptable  solution 
while  avoiding  the  obstacles  to  reach  the  target. 

5  Simulations 

5.1  Parameter  Settings 

The  manipulator  parameters  are  0.5  [m],  0.4  [m]  and  0.3  [m]  in  length,  0.48  [kg], 
0.36  [kg]  and  0.24  [kg]  in  mass  and  —160  to  -f-160,  —135  to  +135  and  —110  to 
+170  in  joint  range  for  link  1,  link  2  and  link  3  respectively.  C  programming 
was  used  to  model  the  manipulator  dynamically  using  the  dynamic  equations 
given  by  Eq.  (1).  Sampling  time  interval  is  0.05  [s]  and  the  differential  equations 
are  solved  by  the  numerical  method  known  as  Runge-Kutta-Gill  method.  The 
maximum  time  allowed  for  one  individual  run  is  15  [s].  Three  obstacles  are 
considered  and  two  simulations  were  carried  out.  In  the  first  simulation  the  robot 
is  placed  initially  on  the  O  —  X  axis  horizontally  and  in  the  second  simulation 
it  is  placed  initially  on  the  O  -  y  axis  vertically.  In  each  case,  1000  generations 
were  taken  into  account  to  train  the  fuzzy  parameters. 
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5.2  Results 

Figure  2  shows  the  end-effector  point  path  for  an  individual  with  the  best  fitness, 
whose  value  is  0.847344,  after  1000  generations  in  simulation  1,  and  Fig.  3  shows 
the  corresponding  best  and  the  average  fitness  values  of  each  generation.  Figures 
4  and  5  present  the  similar  results  with  the  best  fitness  values  equal  to  0.968144 
for  simulation  2.  These  results  show  that  for  all  cases  the  robot  managed  to 
reach  the  target  with  an  acceptable  solution.  Therefore,  it  is  confirmed  from 
these  simulations  that  the  present  approach  is  useful  for  the  task  control  of 
multi-link  manipulators,  while  avoiding  obstacles. 


X  coordinate  [m] 


Fig.  2.  End-effector  point  path  with  a  best  individual  for  simulation  1 


6  Conclusions 

The  fuzzy  behavior-based  control  strategy  has  been  applied  to  controlling  a 
multi-link  manipulator.  It  was  proved  from  simulations  that  such  an  approach  is 
effective  for  complex  manipulators  to  achieve  certain  tasks  while  avoiding  obsta¬ 
cles.  Also,  this  approach  had  the  advantage  of  moving  towards  a  particular  point 
without  knowing  the  inverse  kinematics  while  avoiding  obstacles.  It  means  that 
the  manipulator  can  be  controlled  to  achieve  the  desired  task  with  on  line  infor¬ 
mation  and  suitable  fitness  function  without  going  into  actual  analytical  details. 
This  is  very  useful  especially  in  robot  manipulators  because  the  kinematics  and 
dynamics  of  the  manipulators  are  usually  complex  by  nature  and  the  analysis 
gets  more  sophisticated  for  redundant  manipulators.  This  approach  is  of  course 
suitable  not  only  when  the  relationships  of  the  system  dynamics  are  linear,  but 
also  when  the  relationships  are  nonlinear. 
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Abstract:  Cash  amounts  and  interest  rates  are  usually  estimated  by  using 
educated  guesses  based  on  expected  values  or  other  statistical  techniques  to 
obtain  them.  Fuzj^  numbers  can  capture  the  difficulties  in  estimating  these 
parameters.  Fuzzy  equivalent  uniform  annual  value,  fuzzy  future  value  are  the 
methods  examined  with  numeric  examples  in  the  paper.  The  paper  also  gives 
the  ranking  methods  of  fuzzy  number 

1  Introduction 

To  deal  with  vagueness  of  human  thought,  Zadeh  [1]  first  introduced  the  fuzzy  set  theory, 
which  was  based  on  the  rationality  of  uncertainty  due  to  imprecision  or  vagueness.  A  major 
contribution  of  fuzzy  set  theory  is  its  capability  of  representing  vague  knowledge.  The 
theory  also  allows  mathematical  operators  and  programming  to  apply  to  the  fuzzy  domain. 

A  fuzzy  number  is  a  normal  and  convex  fuzzy  set  with  membership  function  //  ^  (j:) 
which  both  satisfies  normality:  /i^(jf)”l,  for  at  least  one  xeR  and  convexity: 

^  n y((x2)y  where  fi €[0,1]  and  \/x'  e[xj,jf2]-  *A*  stands  for 
the  minimization  operator. 

Quite  often  in  finance  future  cash  amounts  and  interest  rates  are  estimated.  One  usually 
employs  educated  guesses,  based  on  expected  values  or  other  statistical  techniques,  to 
obtain  future  cash  flows  and  interest  rates.  Statements  like  approximately  between  $  12,000 
and  $  16,000  or  approximately  between  10%  and  15%  must  be  translated  into  an  exact 
amount,  such  as  $  14,000  or  12.5%  respectively.  Appropriate  fuzzy  numbers  can  be  used  to 
capture  the  vagueness  of  those  statements. 

A  tilde  will  be  placed  above  a  symbol  If  the  symbol  represents  a  fuzzy  set.  Therefore, 
?,F,G,y?,r,r  are  all  fiizzy  sets.  The  membership  functions  for  these  fiizay  sets  will  be 
denoted  by  etc.  A  fuzzy  number  is  a  special  fuzzy  subset  of  the 

real  numbers.  The  extended  operations  of  fuzzy  numbers  can  be  found  in  [11,  12].  A 
triangular  fuzzy  number  (TFN)  is  shown  in  Figure  I.  The  membership  function  of  a  TFN 
(A/) defined  by 


fi(x  A/)  =  (m, ,  /,  (>>1  /  OT2  fi  (>>1  A?),  m3  ) 


(1) 


267 


where  ®  continuous  monotone  increasing  function  of;/  for 

0^;/^!  with  /, (o| Af )  =  w,  and  f =  m2  And  /2(;/|a/)  is  a  continuous  monotone 
decreasing  function  of  >/  for  0  ^  ^  I  with  /j  (l|  ^)  =  mj  and  (o|  M)-my.  //(jr|  M)  is 
denoted  simply  as  (w,  /wj  ./Wj  /Wj). 


Figure  I.  A  Triangular  Fuzzy  Number,  M 

A  flat  fuzzy  number  (FFN)  is  shown  in  Figure  2.  The  membership  function  of  a 
FFN,  Fis  defined  by 

/'(•»|>')  =  ('»|./|(;^^)/'»J,»I3  (2) 

where  /»,  -</«2  -K/Wj  -</M4,  /|(y|F)  is  a  continuous  monotone  increasing  function  ofy  for 
with  /,(o|k)  =  W|  and  /,(i[f)  =  /M2  and  /2(y|^)is  a  continuous  monotone 
decreasing  function  of  >/  for  0<:y ^  1  with  /2(i|f)  =  Wj  and  /2(o|f)  =  W4.  //(y|F)  is 
denoted  simply  as  (w,  /  /Wj ,  Wj  /  m4 ) . 
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The  fuzzy  sets  P,  F^G,  AJ,r  are  usually  fuzzy  numbers  but  n  will  be  discrete  positive 
fuzzy  subset  of  the  real  numbers  [2].  The  membership  fijnction  ^ix\n)  is  defined  by  a 
collection  of  positive  integers  where 


/j{xn)  =  ‘ 

[o,  otherwise 

2  Fuzzy  Future  Value  Method 

The  future  value  (FV)  of  an  investment  alternative  can  be  determined  using  the  relationship 


FV(r)  =  '^P,(\  +  i)"''  W 

I-O 


where  FV(r)  is  defined  as  the  future  value  of  the  investment  using  a  minimum  attractive  rate 
of  return  (MARR)  of  r%.  The  future  value  method  is  equivalent  to  the  present  value 
method  and  the  annual  value  method. 

Chiu  and  Park's  [3]  formulation  for  the  fuzzy  fiiture  value  has  the  same  logic  of  fuzzy 
present  value  formulation: 


{V[niax(/»/^^\o)  n  (l  +  rl^^^  +  min(p/^^\o)  fl  (I )1+ 

/=0  /'=/+!  ' 

n  +  +  fl  {\  +  )]+ 

/=0  /’=/+! 

Buckley’s  [2]  membership  function  //(jr|F)  is  determined  by 

For  the  uniform  cash  flow  series,  //(xj?)  is  determined  by 


(5) 


(6) 


where  /=i,2  and  ^(«,r)  =  (((1  +  r)"  -  l)/r)  and  3 }-  0  and  F  J-  0. 

3  Fuzzy  Equivalent  Uniform  Annual  Value  (EUAV)  Method 

The  EUAV  means  that  all  incomes  and  disbursements  (irregular  and  uniform)  must  be 
converted  into  an  equivalent  uniform  annual  amount,  which  is  the  same  each  period.  The 
major  advantage  of  this  method  over  all  the  other  methods  is  that  it  does  not  require  making 
the  comparison  over  the  least  common  multiple  of  years  when  the  alternatives  have  different 
lives  [5].  The  general  equation  for  this  method  is 
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EUAV  =  A=NPVr'\n,r)=  NPV[ 

(l+r)"-! 


(8) 


where  NPV  is  the  net  present  value.  In  the  case  of  fuzziness,  NPV  will  be  calculated  and 
then  the  fuzzy  EUAV  (A^)  will  be  found.  The  membership  function  jh{x\A^)  for  A^  is 
dctennincd  by 


/»/  ) = //  («.//  (M^)) 


and  TFN(y)  for  fuzzy  EVA  V  is 


NPV 


/(>) 


NPV 


r(y) 


(9) 


(10) 


Example 

Assume  that  NPV  =  (-$3, 525.57, -$24.47, +$3,786.34)  and  r  =  (3%,5%,7%).  Calculate  the 
fuzzy  EUAV, 


,„„,..(l03  +  0.02y)'’(0.02;-  +  0.03), 

A, I (31  A)  =  (3,501.1>>- 3^25J7)[ - ,6  -  ; - 1 

(1.03  +  0.02;')  -1 


/6.2  (-3.  >8 10-8  ly + 3,786.34)[  ] 


Fory=0,  /  j ,  (y|^j )  =  -$650.96 
Forr=l,  /e.,  (yi:?«)  =  A. 2  (y|  Jfj)  =  -$4.82 
Fory=0,4j(y|3«)  = +$795.13 

It  is  now  necessary  to  use  a  ranking  method  to  rank  the  triangular  fuz2^  numbers  such  as 
Chiu  and  Park’s  [3),  Chang’s  [6]  method ,  Dubois  and  Trade’s  [7]  method,  Jain’s  [8]  method, 
Kaufmann  and  Gupta’s  [9]  method,  Yager’s  [10]  method.  These  methods  may  give  different 
ranking  results  and  most  methods  are  tedious  in  graphic  manipulation  requiring  complex 
mathematical  calculation.  In  the  following,  two  of  the  methods  which  does  not  require 
graphical  representations  are  given.  Chiu  and  Park’s  (3)  weighted  method  for  ranking  TFNs 
with  parameters  (a,  b,  c)  is  formulated  as 


((a  +  6  +  c)/3)  + (11) 

where  w  is  a  value  determined  by  the  nature  and  the  magnitude  of  the  most  promising  value. 
The  preference  of  projects  is  determined  by  the  magnitude  of  this  sum. 

Kaufmann  and  Gupta  (9)  suggest  three  criteria  for  ranking  TFNs  with  parameters  (a,b,c). 
The  dominance  sequence  is  determined  according  to  priority  of: 
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1 .  Comparing  the  ordinary  number  (a+2b+c)/4 

2.  Comparing  the  mode,  (the  corresponding  most  promise  value),  b,  of  each  TFN. 

3.  Comparing  the  range,  c-a,  of  each  TFN. 

The  preference  of  projects  is  determined  by  the  amount  of  their  ordinary  numbers.  The 
project  with  the  larger  ordinary  number  is  preferred.  If  the  ordinary  numbers  are  equal,  the 
project  with  the  larger  corresponding  most  promising  value  is  preferred.  If  projects  have  the 
same  ordinary  number  and  most  promising  value,  the  project  with  the  larger  range  is  preferred. 

4  Conclusions 

In  this  paper,  capital  budgeting  techniques  in  the  case  of  fuzziness  and  discrete 
compounding  have  been  studied.  Fuzzy  set  theory  is  a  powerful  tool  in  the  area  of 
management  when  sufficient  objective  data  has  not  been  obtained.  Appropriate  fuzzy 
numbers  can  capture  the  vagueness  of  knowledge.  The  other  financial  subjects  such  as 
replacement  analysis,  income  tax  considerations,  continuous  compounding  in  the  case  of 
fuzziness  can  be  also  applied  [1 1]»  [12].  Comparing  projects  with  unequal  lives  has  not  been 
considered  in  this  paper.  This  will  also  be  a  new  area  for  a  further  study. 

References 

[1]  Zadeh,  L.  A.,  Fuzzy  Sets,  Information  and  Control,  Vol.  8,  pp.  338-353,  (1965). 

[2]  Buckley,  J.  U.,  The  Fuzzy  Mathematics  of  Finance,  Fuzzy  Sets  and  Systems,  Vol.  21, 
pp.  257-273,  (1987). 

[3]  Chiu,  C.Y.,  Park,  C.S.,  Fuzzy  Cash  Flow  Analysis  Using  Present  Worth  Criterion,  The 
Engineering  Economist,  Vol.  39,  No.  2,  pp.  1 13-138,  (1994). 

[4]  Ward,  T.L.,  Discounted  Fuzzy  Cash  Flow  Analysis,  in  1985  Fall  Industrial  Engineering 
Conference  Proceedings,  pp.  476-481,  December  (1985). 

[5]  Blank  ,  L.  T.,  Tarquin,  J.  A.,  Engineering  Economy,  Third  Edition,  McGraw-Hill,  Inc., 
(1978). 

[6]  Chang,  W.,  Ranking  of  Fuzzy  Utilities  with  Triangular  Membership  Functions,  Proc. 
Int.  Conf.  of  Policy  Anal,  and  Inf.  Systems,  pp.  263-272,  (1981). 

[7]  Dubois,  D.,  Prade,  H.,  Ranking  Fuzzy  ‘Numbers  in  the  Setting  of  Possibility  Theory, 
Information  Sciences,  Vol.  30,  pp.  183-224,  (1983). 

[8]  Jain,  R.,  Decision  Making  in  the  Presence  of  Fuzzy  Variables,  IEEE  Trans,  on  Systems 
Man  Cybernet,  Vol.  6,  pp.  698-703,  (1976). 

[9]  Kaufmann,  A.,  Gupta,  M.  M.,  Fuzzy  Mathematical  Models  in  Engineering  and 
Management  Science,  Elsevier  Science  Publishers  B.  V.,  (1988). 

[10]  Yager,  R.  R.,  On  Choosing  Between  Fuz:^  Subsets,  Kybemetes,  Vol.  9,  pp.  151-154, 
(1980). 

[11]  Kahraman,  C.,  Ulukan,  Z.,  Continuous  Compounding  in  Capital  Budgeting  Using 
Fuzzy  Concept,  in  the  Proceedings  of  6***  IEEE  International  Conference  on  Fuzzy 
Systems  {FUZZ-1  EEE'97),  Bellaterra-Spain,  July  1-5,  (1997). 

[12]  Kahraman,C.,  Ulukan,  Z.,  Fuzzy  Cash  Flows  Under  Inflation,  in  the  Proceedings  of 
Seventh  International  Fuzzy  Systems  Association  World  Congress  (IFSA’97), 
University  of  Economics,  Prague,  Czech  Republic,  Vol.  IV,  pp.  104-108,  June  25-29, 
(1997). 

[13]  Zimmermann,  H.  -J.,  Fuzzy  Set  Theory  and  Its  Applications,  Kluwer  Academic 
Publishers,  (1994). 


Semi  Linear  Equation  with  Fuzzy  Parameters 


Said  Melliani 

Departement  of  Applied  Mathematics  and  Informatic 
Faculty  of  Sciences  and  Technics,  B.P  523  Beni  Mellal  Morocco 
E-mail  melliani@fstbm.ac.ma 


Abstract.  In  this  paper  we  introduce  the  concepts  of  fuzzy  solution  for 
a  semi  linear  equation  with  fuzzy  parameters.  The  extension  principle 
described  by  L.  A.  Zadeh  [5]  provides  a  natural  way  for  obtaining  the 
notion  of  fuzzy  solution.  The  fuzzy  extension  of  the  solution  operator  is 
shown  to  provide  the  \mique  solution  in  the  format  case. 


1  Introduction 

Fuzzy  sets  theory  is  a  powerful  tool  for  modelling  uncertainty  and  for  processing 
vague  or  subjective  information  in  mathematical  models.  While  its  main  dir¬ 
ections  of  development  have  been  information  theory,  data  analysis,  artificial 
intelligence,  decision  theory,  control,  and  image  processing  (see  e.g.  [1],  [6],  [7]), 
fuzzy  set  theory  is  increasingly  used  as  a  means  for  modelling  and  evaluating  the 
influence  of  imprecisely  known  parameters  in  mathematical,  technical,  physical 
models.  The  purpose  of  this  paper  is  to  work  out  this  approch  when  the  models 
are  const itued  by  partial  differential  equations. 

Based  on  the  fuzzy  description  of  parameters  and  mathematical  objects,  we  shall 
be  concerned  here  with  partial  differential  equation  in  the  scalar  case  of  the  form 

ut  +  Xux  =  au 
u  (x,  0)  =  uq  (a:) 

Here  the  parameters  a  and  A  will  be  fuzzy  numbers.  The  solution  u(x^t)  at  any 
fixed  point  (x,  t)  will  be  a  fuzzy  number  as  well. 

2  Partial  differential  equations 

Consider  a  semi-linear  equation  : 

Ut  -f-  Xux  —  au  (1) 

for  a  function  u  =  u{x,t)y  where  A  =const.  >  0.  Along  a  line  of  the  family 


X  —  Xt  =  ^  —  const. 
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(’’characteristic  line  ”in  the  a;t-plane)  we  have  for  a  solution  u  of  (1) 

—  —  ~u{Xt  +  ^,t)  =  \ux  -Vut^  au  (2) 

dt  dt 

Hence  u  is  constant  along  a  line,  and  depends  only  on  the  parameter  ^  which 
distinguishes  different  lines.  The  general  solution  of  (1)  has  the  form 

u{x,  t)  =  uo{^)  exp(at)  =  uo{x  -  At)  exp(at)  (3) 

Formula  (3)  represents  the  general  solution  u  uniquely  in  terms  of  its  initial 
values 

u(x,  0)  =  uo(x) 

Conversely  every  u  of  the  form  (3)  is  a  solution  of  (l)with  initial  values  uq 
provided  uq  is  of  class  We  notice  that  the  value  of  u  at  any  point  (a:,  t) 

depends  only  on  the  initial  value  wo  at  the  single  argument  ^  =  a;  —  At,  the 
abscissa  of  the  point  of  intersection  of  the  characteristic  line  through  (x,  t)  with 
the  initial  line,  the  a:-axis.  We  consider  =  IR  x  1R+ 
introducing  the  equation  operator 

E  :  C^{Q)  C{Q)  :  v  [{x,  t)  ^  wt  +  Aw^;  -  av]  , 

the  restriction  operator  V  (x,  t)  6 

:  C^{Q)  IR :  w  ^  v{x,  t)  =  wo(a:  -  At)e°^^ 

and  the  solution  operator 

L  :  IR^  C^iS7)  :  (A,  a) L  (A,  a) . 

3  Fuzzy  sets  and  fuzzy  numbers 

Geven  a  set  X,  a  fuzzy  set  A  over  X  is  a  map 

rriA  '  X  — >  [0, 1] 

called  the  membership  of  A  (it  is  convenient  to  distinguish  between  A  and  its 
membership  functions  to  be  able  to  employ  the  usual  language  of  sets  the¬ 
ory).  Thus  given  x  e  A,  m^(a:)  is  considered  the  degree  to  which,  respectively 
the  possibility  that,  x  belong  to  A  (In  calssical  sets  theory,  tua  would  correspond 
to  the  characteristic  function  of  A).  This  concept  allows  to  model  uncertainty  in 
situations  where  more  information  than  just  upper  and  lower  bounds  is  available 
(in  contrast  to  interval  analysis),  but  no  probability  distribution  are  available. 
This  situation  often  arises  e.g.  in  engineering  practice,  when  parameters  are 
estimated  partially  in  subjective  way. 

We  denote  the  family  of  fuzzy  sets  over  X  by  F(X).  The  a-level  sets  are  the 
classical  sets 

A^^  =  {xeX  :  rriAix)  >  a} 
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A  fuzzy  real  number  is  an  element  A  €  IF  (IR)  such  that  all  level  sets  A^-  are 
compact  intervals  (0  <  a  <  1)  and  is  not  empty.  The  graph  of  tra  has  a 
monotonically  increasing  left  branch,  a  central  point  or  plateau  of  membership 
degree  one,  and  a  monotonically  decreasing  right  branch.  Similary,  one  can  define 
fuzzy  vectors,  fuzzy  functions  etc.The  extension  principle  introduced  by  [5]  allows 
the  evaluation  of  functions  on  e.g.  fuzzy  numbers  according  to  the  following 
definition  :  Let 

f  JR 

be  a  function,  define  the  extension  [1],  [3] 

/  :  (F  (lR)r  ^  F  (IR) 


by 

=  sup  inf(moi(a:i),...,mo„(!!:n)) 

— >®n) 

It  can  be  shown  that  in  case  /  is  continuous,  /(ai, . . . ,  a^)  is  a  fuzzy  number  as 
well,  and 

the  set  theoretic  image  of  the  level  sets.  Thus  the  upper  and  lower  endpoints  of 
the  interval  /(aj, . . . ,  can  be  obtained  by  minimizing  /  maximizing  /  over 

a" -  X  ...  X  . 

we  denote  by  0  the  crisp  zero  function  in  F  (IR’^),  that  is, 

^oif)  I  Q  otherwise 
Definition  1.  Let  A  a  fuzzy  set 

—  A  is  normalized  if  there  exists  an  element  a;  in  A  such  that  mA{x)  =  1. 

—  The  a-level  sets  A^-  for  (0  <  a  <  1)  of  a  fuzzy  set  A  are  the  classical  (crisp) 
sets. 

—  A  is  convex  if  and  only  if  its  a-level  are  convexs. 

”  A  fuzzy  number  is  a  convex,  normalized  fuzzy  subset  of  the  domain  A 


The  concept  of  fuzzy  number  is  an  extension  of  the  notion  of  real  number  :  its 
encodes  approximate  but  non  probabilistic  quantitative  knowledge  [2]. 


4  Fuzzy  semi-linear  equation 

Let  us  consider  a  fuzzy  semi- linear  equation 

{ut  +  Xux  =  au  / 

u(x,  0)  = 

with  A  and  a  are  two  fuzzy  numbers,  the  initial  condition  uq  is  a  classic  function 
in  C{JR). 
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by  the  extension  principle 

E  :F(Ci(n)) -*F(C(n)) 

:  IF  (Ci(ft)) F(1R) 

L  :F(IR2)  ^F(Ci(ft)) 

V(a:,  t)  €  IR  X  IR-(- 

La:,t :  F  (IR2)  ^F(IR) 

^A,  t)  solution  of  (4) 

The  fuzzy  value  Lx,t  (a,  may  be  computed  by  the  extension  principle  in  this 
way 

'^L,t{x,a)(y)  =  •  y  =  a)} 

Lemma  2.  We  have 

Rx,t  o  L  —  Lx,t  F  (iR  ) 

Proof  (of  lemma). 

”*L.,(A,a){v)  =  «’^P(A,a)  glR^  minK(A),ma(a)) 

y  =  La;,t(A,  a) 

"^L(A,a)(/)  =  (A,  a)  €  IR^  {mxW^maia)) 

f  (xyt)  =  Lx^ti^,  a) 

^R...oL{x.a)(y)  =  (A.  a)  €  IR^ 

y  =  Rx,t  °  T(A,  a) 

=  sup  ^  j^2  min  (m3;(A),  ma(a)) 

y  =  uq{x  —  Xt)e°'^ 

=  ®"P  (A,  a)  6  (’"a(^)'  ’”a(“)) 

y  =  Px,t{^j  fl) 

=  "'L(A,a)(2/) 

□ 

we  have 

m^d)  —  sup  {inf  (m^(A),  ma{a))  :  /  =  i/(A,  a)} 

and  ^ 

^Lx  t(A,a)(2/)  “  sup  {miif)  :  f  €  with  y  =  f(x,  t)} 

=:sup  min(m3;(A),ma(a)) 

y  =  Lx,td: 
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Definition  3.  An  element  u  6  F(C^(n))  is  called  a  fuzzy  solution  with  the 
initial  data  uq  G  C(n),  if  £/  (n)  =  0  in  F  (0(0)),  Rx,ti^)  =  uq  (x  —  in 

F(F) 

Proposition  4.  Given  A,  o  in  F  (F),  u  —  L  ^A,  is  a  fuzzy  solution  of  (4) 

Proof  (of  proposition).  To  show  that  u  =  L  (^,q^  solves  the  fuzzy  partial  dif¬ 
ferential  equation,  we  compute  : 

”^B(C)=L(X,a)  (“)  =  sup  \^u=L{\,a)  =  ^(«)} 

=  sup  |sup  {inf  ^  a)}  :  la  =  ^(^)} 

i(  w  ^  0  and  w  =  E{v);  then  {(A,  a)  €JR?  :v  =  L(A,a)}  =  0,  so  the  inner 
supremum  is  zero  and  =  0. 

if  n;  =  0,  we  may  take  (A,o)  G  IP?  with  m^(A)  =  ma{a)  —  1  and  v  =  L(A,  a). 
Then  £7(iy)  =  0,  and  so  the  supremum  equals  1. 

Let  5'  =  {n  G  0^(0)  :  E(u)  =  O}.  We  can  view  F(5')  as  a  subset  of  F  (C^(0)), 
setting  the  membership  degree  of  any  u  G  0^(0) \S  to  some  n  G  F  (5)  equal  to 
zero.  O 

Lemma  5.  //n  G  F  (C^(0))  is  a  solution  to  (4)  ,  then  u  belongs  to  F  (S) 

Proof  (of  lemma),  we  have  that 

mEiu)iy)  =  sup  {mii{w)  :  v  =  F(w)} 

suppose  there  exist  v  ^  S  ,  such  that  m(ii)('i;)  >  0.  Putting  w  =  E{v)  we  have 
^e(u)M  ^  >  0)  contradicting  the  hypothesis  that  E(u)  =  0.  □ 

Proposition  6.  The  fuzzy  solution  n  G  F  (C^(0))  to  (4)  is  unique. 

Proof  (of  proposition).  Since  L  :  ^  5'  is  bijective,  the  same  is  true  of  the 

extension  L  :  F  (F^)  F(5).  If  n  G  F(C^(0))  is  a  solution,  it  belongs  to 

W  (S)  by  the  lemma  and  hence  is  uniquely  determined  by  the  initial  data.  □ 
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Abstract.  Based  on  axiomatic  rough  set  theory,  first  order  rough  logic 
weis  developed  ecirlier.  In  this  paper,  a  new  model  theory  for  that  logic 
is  introduced.  With  this  new  semantic,  first  order  rough  logic  is  shown 
to  be  equivalent  to  first  order  55,  Eind  hence  consistent  and  complete. 


1  Introduction 

One  of  the  most  important  studies  of  rough  set  theory  is  the  study  of  the  lower 
and  upper  approximations  of  equivalence  relations.  Many  interesting  properties 
have  been  reported  [4].  In  1993,  we  presented  an  axiomatic  approach  to  such 
a  study,  namely,  we  showed  two  abstractor  operators  which  are  characterized 
by  certain  axioms  are  the  lower  and  upper  approximations  of  an  equivalence 
relation  [1].  By  translating  these  axioms  into  logic  terms,  we  constructed  first 
order  rough  logic.  The  syntax  is  similar  to  the  modal  logic  55,  its  semantics, 
however,  is  different  [2];  we  showed  the  model  is  consistent,  but  we  had  not 
proved  its  completeness.  In  this  paper,  we  revisit  the  model  and  propose  a  new 
semantics.  With  this  new  semantics,  we  show  that  first  order  rough  logic  is 
equivalent  to  first  order  55,  and  hence  it  is  consistent  and  complete. 


2  Possible  Worlds  Semantics  -  An  Informal  Overview 

In  this  section  we  shall  describe  the  relationships  between  rough  logic  and  rough 
set  theory.  Rough  set  theory  is  based  on  a  known  equivalence  relation  (indis- 
cernibility  relation).  However,  in  applications,  the  equivalence  relations  are  often 
unknown,  so  the  proposed  rough  logic  is  based  solely  on  the  notions  of  "lower” 
and  "upper”  approximations  without  using  an  exp/icif  equivalence  relation.  How¬ 
ever,  in  order  to  see  clearly  the  relationships  between  rough  logic  and  rough  set 
theories,  we  explain  the  idea  using  explicit  equivalence  relations.  Subsequent 
expositions,  the  use  of  equivalence  relations  is  avoided. 
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2.1  Observable  Worlds 

Let  E  be  the  universe  of  discourse  and  P  an  equivalence  relation.  Note  that 
we  should  view  P  as  the  one  induced  by  the  axioms  of  two  abstract  opera¬ 
tors.  By  abuse  of  notation,  we  will  use  P  to  denote  the  corresponding  partition. 
The  collection  of  all  equivalence  classes  is  called  quotient  space,  denoted  by  Q. 
Based  on  the  available  information,  all  elements  in  the  same  equivalence  class 
are  indistinguishable.  To  each  observer,  an  equivalence  class  is  a  multi-set  (bag) 
that  consists  of  multiple  copies  of  one  element.  We  should,  however,  also  note 
that  different  observers  may  see  the  multiple  copies  of  different  element  (but  in 
the  same  equivalence  class).  This  led  us  to  define  an  observable  world  as  the 
collection  of  one  representative  from  each  equivalence  class.  Different  observers 
have  different  collections  of  representatives.  We  should  also  point  out  that  not 
all  mathematical  combinations  of  representatives  will  be  observed  by  some  one. 
In  other  words,  a  collection  of  all  observable  worlds  is  a  subset  of  all  possible 
combinations. 


The  intent  of  rough  logic  is 

1.  to  describe  E  as  much  as  we  can,  using 

2.  only  the  available  imperfect  observations  (observable  worlds). 

Example,  The  universe  £?  =  {1, 2, 3, 4, 5, 6, 7, 8, 9}  has  a  partition: 

Hi  =  {3, 6, 9},  H2  =  {2, 5, 8},  ^3  =  {1, 4, 7}. 

Then,  the  quotient  space  is  Q  =  {Hi^H^,  Hs} 

Observers  may  see  E  as,  for  example, 

=  {1,1, 1,2, 2,2,3, 3,3},  W2  =  {1,1,1, 2,2,2, 6,6,6} 

If  we  use  set  notations,  they  are 

=  {1,2,3},  W^  =  {1,2,6}, 

Other  possible  observable  worlds  are 

=  {1,2,9},  =  {1,5,3}, 

.  .  ., 

=  {7,8,6},  =  {7,8,9} 

Each  is  a  set  of  representatives  and  equivalent  to  the  quotient  space  Q. 
The  relational  structure  on  each  is  induced  from  E  by  restriction;  see  next 
subsection. 
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2.2  Induced  Structures  on  Observable  Worlds 

In  this  section,  we  will  explain  how  observable  worlds  get  their  relational  struc¬ 
tures.  The  material  in  this  section  is  slightly  different  from  the  corresponding 
structures  given  in  [2] .  The  structure  on  relations  is  the  same.  For  the  structure 
of  functions  is  slightly  different;  in  the  cited  paper,  the  function  values  that  are 
out  of ’’range”  was  treated  as  has  missing  values;  in  this  paper,  we  replace  the 
missing  values  by  their  equivalent  ones.  Each  is  a  subset  of  E,  hence  all  rela¬ 
tions  and  functions  on  E  can  be  interpreted  to  by  restricting  their  domain  to 
.  Intuitively,  each  represents  one  particular  imperfect  observation.  These 
induced  functions  may  be  distorted  and  relations  may  have  missing  values. 

Functions 

Let  E  —  {1, 2, 3, 4, 5,  6, 7, 8, 9}  be  the  universe  of  discourse.  Let  /(— )  be  a  func¬ 
tion  defined  by 

/(x)  =  9-[a:/2], 

where  [z]  represents  the  integral  part  of  z.  The  function  /(— )  induces  a  new 
function  on  each  for  example, 

(1)  In  Observable  World  =  {1, 2, 3},  the  function  values  are: 

/(I)  =  9,  /(2)  =  8,  /(3)  =  8, 

since  these  values  lie  outside  of  ,  we  replace  them  by  their  equivalent  values, 
namely, 

/(I)  =  3,  /(2)  =  2,  /(3)  =  2, 

Such  a  new  function  is  the  induced  view  of  /  on  . 

(2)  In  Observable  World  =  {1,2,6},  the  same  function  will  be  replaced  by 
their  equivalent  values  in  W^,i.e., 

/(I)  =  6,  /(2)  =  2,  /(6)  =  6. 

It  is  the  induced  view  of  /  in 

So  the  same  function  f  is  distorted  into  a  different,  yet  equivalent,  function  on 
each  observable  world.  These  distorted  functions  are  equivalent  in  the  sense  that 
each  function  induces  the  same  function  in  the  quotient  space  Q.  Intuitively, 
each  distorted  function  represents  an  imperfect  observation  of  the  ’’true”  func¬ 
tion.  The  goal  of  rough  logic  is  to  recapture  some  essential  features  of  the  function 
/  via  these  distorted  versions. 

Relations 

Let  R  denote  the  collection  of  all  relations  in  E.  Let  r  be  an  n-ary  relation,  and 
its  restriction  to  some  values  in  r  may  not  appear  in  r^.  The  collec¬ 
tion  of  these  restricted  relations  will  be  denoted  by  R^ .  We  do  require  R^  is  an 
non-empty  set;  see  next  subsection. 
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2.3  Impossible  World 

First  we  should  note  that  not  all  mathematical  would  be  an  observable 

world.  In  order  for  it  to  be  qualified  as  an  observable  world,  at  least,  one  of  these 
n-ary  relations  r^’s  is  an  non-empty  relation.  Informally,  an  impossible  world  is 
a  world  in  which  all  predicates  are  evaluated  to  false  in  all  instances.  So  we  do 
require 

no  observable  world  is  an  impossible  world. 

Let  P(xi,X2, . . .)  be  a  n-ary  predicate  and  the  variable  Xi  are  assigned  to  e,-  £ 
E,i  —  1,2,....  Then  the  predicate  P(ei,e2,...)  is  evaluated  to  truth  at 
if  the  relation  that  interprets  P  contains  the  tuple,  (ei,e2,  •)•  ^  world  in 

which  is  an  empty  set  is  an  impossible  world. 

3  Axiomatic  Rough  Set  Theory 

In  last  few  subsections,  we  explain  the  possible  world  semantics  using  an  explicit 
equivalence  relations.  In  applications,  such  explicit  equivalence  relations  may  not 
be  available.  We  recall  here  the  axiomatic  rough  set  theory,  in  which  only  upper 
and  lower  approximate  operators  are  available. 

Pawlak  introduced  rough  sets  via  equivalence  relations.  He  derived  many 
interesting  properties  of  upper  and  lower  approximation.  In  [1],  we  showed  that 
Pawlak ’s  lower  and  upper  approximation  can  be  characterized  axiomatically  by 
the  following  ”Six”  Axioms:  Let  E  be  the  universe  of  discourse,  X  C  E,  and 
C{X)  =  E  X.  Let  L  and  H  be  the  lower  rough  and  upper  (higher)  rough 
operators. 

(l)iy(0)  =  0;  (2)L(X)  C  X; 

(3)L(X)  C  L{L(X)y,  {A)H{X)(JH(Y)  =  H(X\JYy, 

(5)L(C'(X))  =  C{H(X)y  (6)L(X)  =  H{L{X)y 
{6a)H(X)  =  L(H(X)y  (6a)F(X)  =  L{H{X)y 

These  seven  axioms  are  not  minimal.  Since  this  is  not  a  mathematical  paper, 
we  will  not  digress  on  it.  The  axioms  consist  of  the  Kuratowski’s  axioms  of 
topological  spaces  and  one  additional  axiom  that  declare  open  sets  are  close  sets 
and  vise  versa.  Essentially,  the  seven  axioms  characterize  clopen  spaces.  It  is 
easy  to  see  that  a  clopen  space  induces  a  partition,  and  hence  an  equivalence 
relation.  The  two  abstract  operators  H  and  L  turns  out  to  be  the  upper  and 
lower  approximations  of  the  induced  equivalence  relation. 

4  Rough  and  5^5  Models 

The  language  and  axioms  of  rough  and  55  logic  are  not  specifically  referenced 
in  this  paper,  so  we  refer  readers  to  [2]  for  details. 
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In  this  section,  we  will  show  that  the  models  of  rough  and  55  logic  are 
equivlent.  Following  [3],  a  frame  for  55,  or  for  short  55-frame,  is: 

5M  =  (A^,VF,5,7) 

where 

1.  D  is  a  non-empty  set,  called  domain  of  SM. 

2.  TV  is  a  subset  of  Z),  called  constants, 

3.  W  is  Si  non-empty  set  of  possible  worlds, 

4.  B  is  a  binary  relation  of  "accessibility”  on  W]  for  55  logic,  B  can  be  the 
trivial  one,  namely,  every  worlds  are  accessible  to  each  other, 

5.  7  is  a  function  which  assigns  to  each  pair  consisting  of  an  n-ary  predicate 
symbol  n  >  0,  differ  from  and  an  element  w  of  W,  an  n-ary  relation 
on  Dy  to  the  symbol  ”=”,  if  present  in  the  language,  the  identity  relation  on 
Dy  and  to  each  n-ary  functions  symbol,  and  to  each  n-ary  function  symbol 
a  function  from  to  D. 

7  is  called  an  interpretation.  Intuitively  D  is  "equivalent”  to  the  quotient  set  Q. 


4,1  Rough  Model 
A  Rough  Model  is  a  7-tuple 

RM  =  (Ey  Ny  Ry  Fy  RO  y  Wy  j) 


where: 

1.  £■  is  a  set  of  entities  {e,  ei, . . .}; 

2.  TV  is  a  set  of  distinguished  entities  {n,  ni , . . called  the  domain  of  constants; 
moreover  ff(ni)  =  n,- ,  i  =  1, 2, . . . . 

3.  R  is  a  set  of  non-empty  relations  {r,  ri, , . each  of  which  is  defined  on  E] 

4.  F  is  a  set  of  functions  {/,  /i, . . each  of  which  is  defined  on  E; 

5.  RO  is  a  set  of  rough  operators  satisfying  six  axioms,  i.e,,  RO  =  {H}; 

6.  7  is  a  function  which  assigns  to  each  n-ary  predicate  symbol  n  >  0,  differ 
from  ”=”,  an  n-ary  relation  on  E,  to  the  symbol  ”=”,  if  present  in  the 
language,  the  identity  relation  on  Ey  and  to  each  n-ary  functions  symbol  a 
function  from  F”  to  F. 

7.  W  is  Si  collection  of  observable  worlds  which  are  constructed  from  F  and  RO 
as  explained  in  Section  2, 

=  (W^yN^yR^yF^)y 

where  we  require  that  R^  is  non-empty;  note  that  this  condition  exculdes  the 
impossible  worlds. 
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Roughly,  the  new  model  is  the  same  as  the  one  in  [2],  except  we  exclude  the 
’’impossible  worlds.” 

Two  rough  models 

RM  =={E,N,R,F,RO,W,y) 

RM  =  (E',N',R'/F,RO',W',y') 

are  said  to  be  equivalent,  if  there  is  a  map  F  between  two  models  such  that  F 
induces  isomorphism  between  the  two  families  of  observable  worlds, 

or  more  formally,  F  induces  an  isomorphism  between 

=  {W^,N^,R^,F^),  and 

in  the  sense  there  is  four  family  of  one-to-one  maps 

V/i, 

V/i, 

V/i, 

V/i,  F^  F'^ 

Remark:  It  should  be  noted  we  have  not  required  the  two  model  RM  and  i?M’ 
to  be  isomorphic. 

4.2  The  Equivalence  of  Two  Models 

Next  we  will  have  the  most  important  result  of  this  paper 

Proposition.  A  rough  model  induces  an  S'5-frame  and  vice  versa. 

Proof:  We  will  prove  this  proposition  in  three  steps.  First  we  will  show  that 
a  rough  model  induces  an  ^S'S-frame,  then  in  the  second  step  we  construct  a 
rough  model  from  a  given  iS5-frame.  In  third  and  fourth  steps,  we  show  that  the 
compositions  these  two  steps  are  identity. 

(1)  Step  One:  Assume  we  are  given  a  rough  model  RM^  namely, 

RM  ^{E,N,R,F,RO,W,y) 

Note  that  the  rough  model  has  the  following  family  W  of  observable  worlds, 

W^{W^\ha.n  index  set  },  where  =  (W\  R\  F^), 

We  will  show  that  the  family  W  determines  an  55-frame: 

Fist,  note  that  all  W^’s  are  ’’isomorphic”  to  each  other,  that  is 
see  Section  2,1;  so  are  AT^’s  and  F*’s  respectively.  We  will  identify  them  via 
respective  isomorphisms. 

Ds5  =  W\  V/»,  Nsi  =  N'',  m, 


Fsi  =  F>',  Vft. 
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Let  us  write, 

Wss  =  {NAME{W^)  I  V/i} 

We  will  show  D55,  iVss,  F55,  W55  together  with  an  equivalence  relation  Bs5  and 
an  interpretation  755,  both  to  be  defined  below,  do  form  an  55-frame. 

A)  Construction  of  Bsb-  First,  we  note  that  the  isomorphism  among  W*’s  V  h 
defines  an  equivalence  relation  on  W55;  it  is  a  trivial  one,  i.e.,  there  is  only  one 
equivalence  class;  We  will  denote  this  equivalence  relation  by  Bsb  • 

B)  Define  the  induced  interpretation  755:  Let  (in  domain  D55)  be  the  iso¬ 
morphic  copy  of  i?*(in  domain  W^).  The  union  of  all  those  Vh  is  denoted 
by  R. 

Then  the  induced  interpretation  can  be  defined  as  follows:  In  rough  model  the 
interpretation  7  assigns  a  predicate  symbol  to  a  relation  r  €  i?.  In  trun,  r  in¬ 
duces  a  relation  on  each  V/i;  see  Section  2.  So  the  predicate  is  interpreted 
to  V/i  via  such  a  route.  The  interpretations  of  function  symbols  is  the  same; 
See  Section  2.2. 

Thus  we  have  an  55-frame, 

5M  =  (D,iV,W,5,7). 

(2)  Step  Two:  Conversely,  given  an  55-frame  SM,  we  will  construct  a  rough 
model  A)  Construction  of  E,  N  and  RO:  First,  we  need  to  set  up  the  notations 
for  symbols  in  5M.  We  write 

D  =  I  is  an  index  }  W  =  I  h  is  an  index  }. 

Next,  we  consider  E'  =  D  x  W,  and  write 

=  {(d^,  lu)  I  d^  is  a  fixed  element  in  D  and  w  varies  through  W  } 

The  collection  of  E'^ forms  a  partition  of  J5",  we  call  it  vertical  partition  and 
each  E'^  a  vertical  equivalence  class.  Now,  we  will  consider  a  quotient  set  E  as 
follows:  First  note  that  AT  is  a  subset  of  D.  If  d^  =  is  an  element  of  AT,  then 
we  collapse  the  vertical  equivalence  class  E'k  to  an  element,  denoted  by  (n*,0) 
or  simply  n^.  This  new  set  is  denoted  by  E\  the  collapsing  map  is  denoted 
hy  Q  :  E*  — >  E.  E  inherits  a  partition  from  E':  if  d^  G  AT,  the  equivalence 
class  is  singleton,  if  d*  G  (D  \  N)  then  the  equivalence  class  is  the  vertical 
equivalence  class.  By  abuse  of  terminology,  we  will  refer  to  such  ’’collapsed” 
vertical  partition  as  vertical  partition;  similarly,  each  equivalence  class  still  be 
called  vertical  equivalence  class.  Finally,  we  observe  that  the  vertical  partition 
of  E  gives  rise  to  the  upper  approximation  operator  H,  which  satisfies  the  six 
axioms;  see  Section  3.  This  constitutes  the  component  RO. 

B)  Construction  of  R.  Let  p  be  an  n-ary  predicate  symbol,  and  its  interpretation 
in  (55-frame  SM)  be 

y(p,w'')  =  r*, 
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where  r*  is  an  n-ary  relation  in  D,  that  is, 

r*  =  1  dj  e  r>}  C  D" 

We  should  stress  that  those  are  associated  with  w^.  We  shall  ” embed” 
into  a  relation  on  E  through  the  following  consideration.  First,  we  ” embed”  it 
to  E' 

(*)  r^xw^  =  {((dj,  ,  (d?,  |  (dj,  d?...,  d?)  €  for  a 

fixed  h  } 

Then  we  apply  the  collapsing  map  Q.  Q{r^  xw^)  is  the  induced  relation  on  E\  it 
will  be  denoted  by  r,  i.e.,  r  =  Q(r^  x  w^)  Next,  we  consider  the  union  (varying 

h), 

r  is  a  subset  of  £*”,  hence  is  a  relation  on  E.  This  r  is  an  interpretation  of  p  in 
E.  Let  the  collection  of  all  such  r^s  be  denoted  by  R\  a  required  component  of 
RM. 

C)  Construction  of  F’:  F  consists  of  all  functions  that  are  the  Cartesian  product 
of  a  function  d  :  D”  — >  D  and  identity  map  on  W. 

D)  Construction  of  W:  The  observable  worlds  will  be  induced  from  F,  and  its 
vertical  partition,  as  explained  in  Section  2.1 

Combining  A),  B),  C)  and  D),  we  have  constructed  RM. 

(3)  Step  Three:  Now,  we  need  to  complete  ”the  cycle.”  By  Step  Two,  we  have 
constructed  RM2  from  SMi ;  please  note  the  index.  By  Step  One  RM^  is  trans¬ 
formed  back  to  SMs]  finally,  we  need  to  show  two  SM's  are  equivalent.  Let  X 
be  an  SM  and  is  transformed  to  Y  by  Step  Two.  In  the  model  Y ,  E  has  a  ver¬ 
tical  partition.  To  get  an  observable  world,  we  select  a  representative  from  each 
vertical  equivalence  class.  The  selection  can  be  expressed  by  the  composition  of 
a  map  d'  — >  (d,w)  and  Q.  We  write  the  composition,  d  — >  {d',w))  — >  w, 
by  /,  understanding  that  when  d^n^N,w^d.  We  will  say  /  is  a  constant 
map,  if  /(d)  =  «;<,  is  a  constant  on  D\N,  D  minus  N.  The  observable  world 
selected  will  be  denoted  by  From  Step  One,  we  know  that  each  is  an 
observable  world,  if  R^  is  non-empty.  However,  from  the  equation  (*)  in  Step 
Two,  one  observes  that  if  /  is  not  a  constant  map,  then  R^  is  an  empty  set, 
hence  are  not  included  in  the  family  of  observable  worlds.  So  the  observable 
worlds  are  precisely  the  same  as  those  given  in  SM.  This  completes  the  proof  of 
cycle  one. 

(4)  Step  Four.  Let  Y  be  an  RMi.  By  Step  One,  it  is  transformed  to  X,  an 
5M2.  We  need  to  show,  by  Step  two  X  will  be  transformed  back  to  RM3; 
we  need  to  show  that  two  RM  are  equivalent  (not  necessarily  equal);  see  Sec¬ 
tion  4.1  Remark  4.1.  In  (3)  Step  3,  we  have  shown  that  X  is  transformed  by 
Step  Two,  then  Step  One  back  to  X.  Now  we  take  Y  and  transform  it  to  X  by 
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Step  One.  By  it  Step  Two,  we  get  Y'.  Now  note  that  the  observable  worlds  of 
Y‘  is  precisley  X,  so  Y  and  Y'  are  equivalent.  QED. 

Since  55-frame  is  complete,  we  have 

Theorem.  Rough  logic  with  this  new  interpretation  is  sound  and  complete. 


5  Conclusion 

Rough  set  theory  models  uncertainty  by  equivalence  relations  (indiscernibility 
relations).  In  real  world  applications,  such  a  precise  knowledge  of  equivalence 
relations  is  often  unavailable.  However,  one  could  often  observe  the  approxima¬ 
tions,  in  other  words,  the  knowledge  of  approximate  operators  are  reasonably 
available.  Based  on  such  belief,  we  developed  the  axiomatic  rough  set  theory 
and  the  first  order  rough  logic  [2]  without  explicit  equivalence  relation. 

The  model,  namely  RM,  proposed  is  too  rich  in  semantics  for  the  syntax. 
The  language  can  not  completely  determine  the  model.  So  in  this  paper,  we 
introduce  the  equivalence  relation  among  those  models;see  Section  4.1.  Then  the 
equivalence  class  contains  the  right  amount  of  information  to  be  characterized 
by  the  syntax.  In  other  words,  RM  is  the  ” ideal  world”  that  the  syntax  intend  to 
address.  However,  due  to  insufficient  information,  the  syntax  can  only  determine 
an  equivalence  class  of  the  ’’ideal”  worlds;  there  are  uncertainty.  Prom  this  aspect, 
rough  logic  is  richer  than  55. 
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Abstract.  A  generalized  decision  logic  in  interval-set- valued  informa¬ 
tion  tables  is  introduced,  which  is  an  extension  of  decision  logic  studied 
by  Pawlak.  Each  object  in  an  interval-set- valued  information  table  takes 
an  interval  set  of  values.  Consequently,  two  types  of  satisfiabilities  of 
a  formula  are  introduced.  Truth  values  of  formulas  are  defined  to  be 
interval-valued,  instead  of  single- valued.  A  semantics  model  of  the  pro¬ 
posed  logic  language  is  studied. 


1  Introduction 

The  theory  of  rough  sets  is  commonly  developed  and  interpreted  through  the 
use  of  information  tables,  in  which  a  finite  set  of  objects  are  described  by  a 
finite  number  of  attributes  [10,  11].  A  decision  logic,  called  DZ-lan'guage  by 
Pawlak  [11],  has  been  studied  by  many  authors  for  reasoning  about  knowledge 
represented  by  information  tables  [8,  11].  It  is  essentially  formulated  based  on 
the  classical  two- valued  logic.  The  semantics  of  the  DL-language  is  defined  in 
Tarski’s  style  through  the  notions  of  a  model  and  satisfiability  in  the  context 
of  information  tables.  A  strong  assumption  is  made  about  information  tables, 
i.e.,  each  object  takes  exactly  one  value  with  respect  to  an  attribute.  In  some 
situations,  this  assumption  may  be  too  restrictive  to  be  applicable  in  practice. 
Several  proposals  have  been  suggested  using  much  weaker  assumptions.  More 
specifically,  the  notion  of  set-based  information  tables  (also  known  as  incomplete 
or  nondeterministic  information  tables)  has  been  introduced  and  studied,  in 
which  an  object  can  take  a  subset  of  values  for  each  attribute  [3,  14,  16,  20]. 
Based  on  the  results  from  those  studies,  the  main  objective  of  this  paper  is  to 
introduce  the  notion  of  interval-set- valued  information  tables  by  incorporating 
results  from  studies  of  interval-set  algebra  [17,  19].  A  generalized  decision  logic 
GDI  is  proposed,  which  is  similar  to  modal  logic,  but  has  a  different  semantics 
interpretation. 

*  This  study  is  partially  supported  by  the  National  Natural  Science  Foundation  of 
China  and  Natural  Science  Foundation  of  JiangXi  Province,  China. 
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This  paper  reports  some  of  our  preliminary  results.  In  Section  2,  we  first 
briefiy  review  Pawlak’s  decision  logic  DL,  and  then  introduce  the  notions  of 
a-degree  truth  and  a-level  truth.  In  Section  3,  the  notion  of  interval-set- valued 
information  tables  is  introduced.  A  generalized  decision  logic  DGL  is  proposed 
and  interpreted  based  on  two  types  of  satisfiabilities.  The  concepts  of  interval- 
degree  truth  and  interval-level  truth  are  proposed  and  studied.  Inference  rules 
are  discussed.  In  Section  4,  two  related  studies  are  commented. 

2  A  Decision  Logic  in  Information  Tables 

The  notion  of  an  information  table,  studied  by  many  authors  [3,  10,  11,  16,  21], 
is  formally  defined  by  a  quadruple: 

S  =  {U,At,{Va  \  a  £  At},{Ia\a  e  Ai}), 

where 


U  is  a.  finite  nonempty  sef  of  objects. 

At  is  a  finite  nonempty  set  of  attributes, 

Va  is  a  nonempty  set  of  values  for  a  €  At, 
la  '  U  — >■  Va  is  an  information  function. 

Each  information  function  la  is  a  total  function  that  maps  an  object  of  U  to 
exactly  one  value  in  Va .  Similar  representation  schemes  can  be  found  in  many 
fields,  such  as  decision  theory,  pattern  recognition,  machine  learning,  data  anal¬ 
ysis,  data  mining,  and  cluster  analysis  [11]. 

With  an  information  table,  a  decision  logic  language  (Z>L-language)  can  be 
introduced  [11].  In  the  T)L-language,  an  atomic  formula  is  given  by  (a,  t;),  where 
a  €  At  and  i;  G  14.  If  ^  and  ^  are  formulas  in  the  iPZ-language,  then  so  are 
<l>  Alp,  (pV  ip,  (p  ip,  and  (p  =  ip.  The  semantics  of  the  DL-language  can  be 
defined  in  Tarski’s  style  through  the  notions  of  a  model  and  satisfiability.  The 
model  is  an  information  table  5,  which  provides  interpretation  for  symbols  and 
formulas  of  the  DL-language.  The  satisfiability  of  a  formula  <p  by  an  object  x, 
written  x  (=5  <p  or  in  short  a:  |=  ^  if  5  is  understood,  is  given  by  the  following 
conditions: 


(al). 

X  1=  (a,  v)  iff  la 

(x)  =  V, 

(a2). 

a:  ^  -i<^  iff  not 

(a3). 

X  \=  (p  Alp  m  X 

1=  <p  and  X  \=  ip, 

(a4). 

X  \=  (pV  ip  if[  X 

\=  <p  ox  X  ^  ip, 

(a5). 

X  \=  <p  tp  i 

j  1=  -i(^  v^. 

(a6). 

III 

JL 

^  (p  xp  and  X  ^  ip  (p. 

For  a  formula  (p,  the  set  ms((p}  defined  by: 

"isW  =  {x  &  u  \  x\=<l>}, 


(1) 
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is  called  the  meaning  of  the  formula  (j)  in  SAf  S  is  understood,  we  simply  write 
m{(f>).  Obviously,  the  following  properties  hold  [8,  11]: 

(bl).  m(a,v)  —  {x  e  U  \  Ia{x)  =  v}, 

(b2).  =  -m{(i)), 

(b3).  m{(f>  Alp)  =  m(<t>)  0  m{'tp), 

(b4).  m{(j)  V  =  rn(<j))  U  m(V'), 

(b5).  m{(l>  V’)  =  -m{(p)  U  m{ip), 

(b6).  m{(l>  =  ip)~  H  m{'ip))  U  (-m{(j>)  H  -m(V^)). 

The  meaning  of  a  formula  <f)  is  therefore  the  set  of  all  objects  having  the  property 
expressed  by  the  formula  (p.  In  other  words,  (p  can  be  viewed  as  the  description  of 
the  set  of  objects  m(<p).  Thus,  a  connection  between  formulas  of  the  Dl-language 
and  subsets  of  U  is  established. 

A  formula  (p  is  said  to  be  true  in  an  information  table  5,  written  <l>  or  \=  <p 
for  short  when  S  is  clear  from  the  context,  if  and  only  if  m{<p)  =  U .  That  is,  <p  is 
satisfied  by  all  objects  in  the  universe.  Two  formulas  <p  and  xp  are  equivalent  in 
S  if  and  only  if  m((p)  =  m(xp).  By  definition,  the  following  properties  hold  [11]: 

(cl).  \=  <pif^  m(<p)  = 

(c2).  1=  -'(p  iff  m(<p)  =  0, 

(c3).  ^  <p xp  if^  m((p)  C  m{xp), 

(c4).  ^(p  =  xpif[  m{(p)  =  m{xp). 

Thus,  we  can  study  the  relationships  between  concepts  described  by  formulas  of 
the  D L-language  based  on  the  relationships  between  their  corresponding  sets  of 
objects. 

The  previous  interpretation  of  DL-language  is  essentially  based  on  classical 
two- valued  logic.  One  may  generalize  it  to  a  many- valued  logic  by  introducing 
the  notion  of  degrees  of  truth  [4,  5].  For  a  formula  <p,  its  truth  value  is  defined 
by  [4,  5]: 

-  -pT' 

where  |  •  |  denotes  the  cardinality  of  a  set.  This  definition  of  truth  value  is 
probabilistic  in  natural.  Thus,  the  generalized  logic  is  in  fact  a  probabilistic 
logic  [7].  When  v{(p)  =  a  G  [0, 1],  we  say  that  the  formula  <p  is  a-degree  true.  By 
definition,  we  immediately  have  the  properties: 

(dl).  (=  (/>  iff  v{<p)  =  1, 

(d2).  1=  -'<p  iff  v(<p)  =  0, 

(d3).  v{-‘<p)  1  -  v{(p) , 

(d4).  v{(p  Axp)  <  min{v{(p),v(xp)), 

(d5).  v{(p  yxp)>  md.x(v(<p),v(xp)), 

(d6).  v{(pV  xp)  -  v{(p) -\-v(xp)  -  v(<p  Axp). 
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Properties  (d3)-(d6)  follow  from  the  probabilistic  interpretation  of  truth  value. 
Similar  to  the  definitions  of  a-cuts  in  the  theory  of  fuzzy  sets  [2],  we  define  a- 
level  truth.  For  a  G  [0, 1],  a  formula  (j)  is  said  to  be  a-level  true,  written  if 

v(<t>)  >  a,  and  (f)  is  strong  a-level  true,  written  1=^+,  if  v{<f))  >  a.  From  (dl)-(d6), 
for  0  <  a  <  /?  <  1  and  7  G  [0, 1]  we  have: 

(el). 

(e2).  If  \=^  (j>,  then  [=«  <!>, 

(e3).  iff  not 

(e4).  If  [=«(/>  A  then  [=«  (^  and  \=a 'fp, 

(e5).  If  |=a  then  )=a(^VV', 

(e6).  If  ^of  4^  tpf  then  |=max(a,7)  ^  V 

Property  (e5)  is  implied  by  properties  (e2)  and  (e6). 

With  the  concept  of  a-level  truth,  we  have  the  probabilistic  modus  ponens 
rule  [15]: 

\=a  <l>-^  4^  -^tp)>OC 

N/3  4>  P _ . 

Nmax(0,a+/3-l)  V'  >  max(0,  O  -f  ^  -  1) 

Given  conditions  v(<j)  —^'ip)>a  and  v[<j>)  >  /?,  from  properties  (d3)  and  (d6), 
we  have: 


v{<j>  xp)  >  a 
v{~Kp  V  xp)  >  a 

v{-y(p)  -I-  v{xp)  —  v{-^<p  Axp)  >  a 
(1  —  'y(^))  +  v{xp)  >  a 
4-  v{<p)  —  1 
v{xp)  >  a  +  —  1. 


Since  the  value  v{xp)  must  be  non-negative,  we  can  conclude  that  the  proposed 
modus  ponens  rule  is  correct.  Similar  properties  and  rules  can  be  expressed  in 
terms  of  strong  a-level  truth. 


3  A  Generalized  Decision  Logic 

Let  A'  be  a  finite  set  and  2^  be  its  power  set.  A  subset  of  2^  of  the  form: 

A  =  [Ai,  As]  =  {A  G  2-^  I  Ai  C  X  C  As}  (3) 

is  called  a  closed  interval  set,  where  it  is  assumed  Ai  C  A3.  The  set  of  all 
closed  interval  sets  is  denoted  by  I(X).  Degenerate  interval  sets  of  the  form 
[A,  A]  are  equivalent  to  ordinary  sets.  Thus,  interval  sets  may  be  considered  as 
an  extension  of  standard  sets.  In  fact,  interval-set  algebra  may  be  considered  as 
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a  set- theoretic  counterpart  of  interval-number  algebra  [6].  A  detailed  study  of 
interval-set  algebra  can  be  found  in  papers  by  Yao  [17,  19]. 

An  interval-set- valued  information  table  generalizes  a  standard  information 
table  by  allowing  each  object  to  take  interval  sets  as  its  values.  Formally,  this 
can  be  described  by  information  functions: 

I^:U—^I(Va).  (4) 

For  an  object  x  6  17,  its  value  on  an  attribute  a  E  At  is  an  interval  set 
Ia{x)  =  [Ia*{x)yh*{x)]^  The  object  x  definitely  has  properties  in  Ia*{x),  and 
possibly  hcLS  properties  in  With  the  introduction  of  interval-set- valued 

information  tables,  a  generalized  decision  logic  language,  called  GUL-language, 
can  be  established.  The  symbols  and  formulas  of  the  OTlZ^language  is  the  same 
as  that  of  the  DT-language.  The  semantics  of  the  OTZ^language  can  be  defined 
similarly  in  Tarski’s  style  using  the  notions  of  a  model  and  two  types  of  satisfia¬ 
bilities,  one  for  necessity  and  the  other  for  possibility.  If  an  object  x  necessarily 
satisfies  formula  we  write  a;  |=+  and  if  x  possibly  satisfies  <^,  we  write  x  \=*  (j). 
The  semantics  of  [=+  and  are  defined  as  follows: 


(fl). 

X  |=* 

(a,v)  iff  v  €  Ia*(x)y 

{a,v)  iff  V  G 

(f2). 

aj  N* 

-\<p  iff  not  X  |=*  <p^ 

xK 

-y(p  iff  not  X  !=♦  <p, 

(f3). 

X  1=* 

A  ^  iff  X  [=♦  <?5>  and  x  |=»  'ip, 

xK 

A  ^  iff  X  [=*  (^  and  x  |=*  tp, 

(«)• 

X  \=, 

<P  y  Ip  X  (p  OT  X  'tpf 

X  |=* 

V  ^  iff  X  \=*  or  X  ipj 

(f5). 

iC  |=* 

(p  Ip  X  [=#  V  V’, 

X  \=* 

(p  xp  if[  X  \=*  -xpy  ‘fpt 

(f6). 

X  (=♦ 

(p  ~  xp  if^  X  (p  xp  and  x  | 

X  [=*  <j)  =  X  <j)  'ip  and  x  \=*  tp  (p, 


The  following  property  follows  immediately  from  definition: 

(gl).  If  X  |=*  (p,  then  X  (=*  <p. 

Although  the  introduced  notions  of  necessity  and  possibility  are  similar  in  nature 
to  the  notions  in  modal  logic  [1],  our  semantics  interpretation  is  different.  There 
is  a  close  connection  between  the  above  formulation  and  three- valued  logic  [19]. 

In  GDLy  with  respect  to  an  interval-set- valued  information  system  S,  the 
meaning  of  a  formula  <p  is  the  interval  set  m{(p)  defined  by: 


m{(p)  -  [{x  €  U  \  X  <t>},{x  €  U  \  X  \=*  (!>}]  =  [m4^),m* {<!>)].  (5) 


It  can  be  verified  that  the  following  properties  hold: 


(hi).  m(a,  v)  ^[{x  eU  \x  (l>},{x  eU  \x  |=*  ^}], 

(h2).  m[-^<j>)  = 

(h3).  m(<^  A  ^)  =  m(<^)  n  m(V’), 

(h4).  m{<?^  V  ^)  =  m(<^)  U  m(^), 

(h5).  m((^ U  m(^), 

(h6).  m{(j>  =  V’)  =  (W(<!^)  LJ  n  (m(<^)  U  \m(V>)), 

where  \,  fl,  and  U  are  the  interval-set  complement,  intersection,  and  union  given 
by  [17]:  for  two  interval  sets  A  =  and  B  =  [Bi,52], 

\A  =  {~X\X  €A}  =  [-A2,-Ail 
AnB  =  {xnY\x  eA,YeB}  =  [>ii 051,^20^2], 

AUB  =  {XUY\X  eA.YeB}  =  [Ai\JBuA2UB2].  (6) 

The  meaning  of  a  formula  <j)  is  therefore  the  interval  set  of  objects,  representing 
those  that  definitely  have  the  properties  expressed  by  the  formula  and  those 
that  possibly  have  the  properties. 

Given  the  meaning  of  formulas  in  terms  of  interval  sets,  we  define  the  interval¬ 
valued  truth  for  a  formula  (j)  by  extending  equation  (2): 


\U\  ’  \U\ 


(7) 


Both  lower  and  upper  bounds  of  [v*{(f>)jV*{(t))]  have  probabilistic  interpretation, 
hence  we  have  a  probability  related  interval- valued  logic  [18].  Properties  corre¬ 
sponding  to  (d3)-(d6)  are  given  by: 


(11) .  v4~^<l>)  =  l-v*  {<!>), 

(12) .  v^{<j)Aip)  <  min(i;*(<^),t;*(^)), 

v*{(l)A'ip)  <mm{v*((j>),v*{ip)), 

(13) .  v^((j)\/ i))>ma.x(v*{(i)),v^{‘ip)), 

V* (<?!>  V  ^)  >  max(i;*  {<f)) ,  v* {'ip)), 

(14) .  v^{(j)V'ip)  =  v^(<p)  +  v^i'ip)  -  v^{(t)  A  'ip), 

V*  {(pV  'ip)  =  V*  {(p)  +  V*  {'ip)  —  v*{(p  A'ip). 


The  formula  <p  is  said  to  be  [v=^(<^),  v*(<^)]-degree  true.  For  a  sub-interval  [a*,  a*] 
of  the  unit  interval  [0, 1],  a  formula  (p  is  [a*,  a*]-level  true,  written  if 

a*  <  v^{(p)  <  v*{(p)  <  a*,  and  (p  is  strong  [a*,  Q!*]-level  true,  written  ^,*5+  (p, 
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if  a*  <  v*((^)  <  v*((t>)  <  a*.  For  sub-intervals  C  C  [0,1]  and 

[7* >7*]  Q  [^)  1]>  following  properties  hold: 


Cii)- 

N[ 

0,1] 

(j2). 

If 

!=[«• 

Of*]  (p,  then 

(j3). 

h[ 

a*, a*] 

iff  not  |=[i_a*,i-a*]+ 

(j4). 

If 

N[«. 

0,*]  (p  A  xp,  then  [=[«,, i]  <p  and 

(jS)- 

If 

a*]  <p,  then  |=[a*,i]  <pV^y 

ae). 

If 

N[a* 

^•]  <p  and  .y*]  xp^  then  [=[max(a*,7*),: 

07). 

If 

N[a. 

a*]  (pVxp,  then  t=[o,a*]  <P  and  |=[o,a*]  V'j 

08). 

If 

l=[a* 

a*]  (p,  then  \=[o,a*]  <pAxp, 

09). 

If 

N[a* 

o(*]  (P  and  .y*]  xp,  then  }=[o,min(cf*,7*] 

They  follow  from  (i2)  and  (i3).  In  fact,  properties  (j4)-(j6)  are  the  proper¬ 
ties  (e4)-(e6)  of  the  J9L-language.  Properties  (j4)-(j6)  show  the  characteristics 
of  the  lower  bound,  while  (j7)-(j9)  state  the  characteristics  of  the  upper  bound. 
The  generalized  interval-bcised  modus  ponens  rule  is  given  by: 

(=[a*  V")  <  V*(<^  1/^)  <  a* 

_  /?.  <  VM  <  V*(<l>)  <  P* _ . 

[=[max(0,a,+/?,-l).a*]  V’  max(0,a*  -f  -  1)  <  <  V*  {tp)  <  Of* 

The  part  concerning  the  lower  bound  is  in  fact  the  probabilistic  modus  ponens 
rule  introduced  in  Section  2.  The  upper  bound  can  be  seen  as  follows.  From 
v*{<j)  'ip)  <  a*  and  (i3),  we  can  conclude  that: 


V*  [ip)  <  V*  {~^<P  V  Ip)  =  V*  [(p  Xp)  <  O'*  . 


Thus,  the  interval-based  modus  ponens  rule  is  correct.  Finally,  it  should  be 
pointed  out  that  the  logic  of  Section  2  is  a  special  case  of  interval-valued  logic. 
More  specifically,  a-level  truth  can  be  translated  into  the  [a,  l]-level  truth. 


4  Comments  on  Related  Studies 


An  interval-valued  logic  can  also  be  introduced  in  the  standard  information 
tables  through  the  use  of  lower  and  upper  approximations  of  the  rough  set  the¬ 
ory  [5,  9].  For  each  subset  of  the  attributes,  one  can  define  an  equivalence  relation 
on  the  set  of  objects  in  an  information  table.  An  arbitrary  set  is  approximated 
by  equivalence  classes  as  follows:  the  lower  approximation  is  the  union  of  those 
equivalence  classes  that  are  included  in  the  set,  while  the  upper  approximation 
is  the  union  of  those  equivalence  classes  that  have  an  nonempty  intersection  with 
the  set.  Thus,  for  a  formula  <p  with  interpretation  we  have  a  pair  of  lower 

and  upper  approximations  apr('m((p))  and  apf[rn((p)).  An  interval- valued  truth 
can  be  defined  as: 


v{^)  = 


\apr{m{(p))\  \apr(’m((p))\‘ 
\U\  ’  \U\  . 


(8) 
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Based  on  this  interpretation  of  interval- valued  truth,  Parsons  et  ai  [9]  introduced 
a  logic  system  RL  for  rough  reasoning.  Their  inference  rules  are  related  to,  but 
different  from,  the  inference  rules  introduced  in  this  paper.  A  problem  with  RL 
is  that  the  interpretation  of  the  rough  measure  is  not  entirely  clear.  The  measure 
is  not  fully  consistent  with  the  definition  of  truth  value  given  by  equation  (8).  It 
may  be  interesting  to  have  an  in-depth  investigation  of  the  interval-valued  logic 
based  on  equation  (8).  An  important  feature  of  such  a  logic  is  its  non-truth- 
functional  logic  connectives.  This  makes  it  different  from  the  interval  set  algebra 
related  systems  GDL  and  RL. 

In  a  recent  paper,  Pawlak  [12]  introduced  the  notion  of  rough  modus  ponens 
in  information  tables.  The  logical  formula  (j>  — >■  ^  is  interpreted  as  a  decision 
rule.  A  certainty  factor  is  associated  with  ^  >  “0  as  follows: 


|m(<^)  n  m('ip)\ 


(9) 


It  can  in  fact  be  interpreted  as  a  conditional  probability.  The  rough  modus 
ponens  rule  is  given  by: 


^  _ . 

xjj  :  A  V’)  +  'fp) 

This  rule  is  closely  related  to  Bayes’  theorem  [13].  One  may  easily  generalize 
the  rough  modus  ponens  if  a-level  truth  values  are  used.  The  main  difference 
between  two  modus  ponens  rules  stems  from  the  distinct  interpretations  of  the 
logical  formula  (j)  'ip. 

5  Conclusion 

Two  generalizations  of  Pawlak’s  information  table  based  decision  logic  DL  are 
introduced  and  examined.  One  generalization  is  based  on  the  notion  of  degree  of 
truth,  which  extend  DL  from  two-valued  logic  to  many-valued  logic.  The  other 
generalization  relies  on  interval-set-based  information  tables.  In  this  case,  two 
types  of  satisfiabilities  are  used,  in  a  similar  spirit  of  modal  logic.  They  lead  to 
interval-set  interpretation  of  formulas.  Consequently,  interval-degree  truth  and 
interval-level  truth  are  introduced  a.s  a  generalization  of  single-valued  degree  of 
truth.  The  truth  values  of  formulas  are  associated  with  probabilistic  interpreta¬ 
tions.  The  derived  logic  systems  are  essentially  related  to  probabilistic  reasoning. 
In  particularly,  probabilistic  modus  ponens  rules  are  studied. 

In  this  paper,  we  only  presented  the  basic  formulation  and  interpretation  of 
the  generalized  decision  logic.  As  pointed  out  by  an  anonymous  referee  of  the 
paper,  a  formal  proving  system  is  needed  and  applications  need  to  be  explored. 
It  may  also  be  intersting  to  analyze  other  non-probabilistic  interpretations  of 
truth  values. 


293 


References 

1.  Chellas,  B.F.,  Modal  Logic:  An  Introduction,  Cambridge  University  Press,  Cam¬ 
bridge,  1980. 

2.  Klir,  G.J.  and  Yuan,  B.,  Fuzzy  Sets  and  Fuzzy  Logic,  Theory  and  Applications, 
Prentice  Hall,  New  Jersey,  1995. 

3.  Lipski,  W,  Jr.  On  databases  with  incomplete  information,  Journal  of  the  ACM, 
28,  41-70,  1981. 

4.  Liu,  Q.  The  01-resolutions  of  operator  rough  logic,  LNAI  1424,  LNAI  1424, 
Springer- Verlag,  Berlin,  432-436,  1998. 

5.  Liu,  Q.  and  Wang,  Q.  Rough  number  based  on  rough  sets  and  logic  values  of  A 
operator  (in  Chinese),  Journal  of  Software,  7,  455-461,  1996. 

6.  Moore,  R.E.  Interval  Analysis,  Prentice-Hall,  New  Jersey,  1966. 

7.  Nilsson,  N.J.  Probabilistic  logic.  Artificial  Intelligence,  28,  71-87,  1986. 

8.  Orlowska,  E.  Reasoning  about  vague  concepts.  Bulletin  of  Polish  Academy  of  Sci~ 
ence,  Mathematics,  35,  643-652,  1987. 

9.  Parsons,  S.,  Kubat,  M.  and  Dohnal,  M.  A  rough  set  approach  to  reasoning  under 
uncertainty.  Journal  of  Experimental  and  Theoretical  Artificial  Intelligence,  7,  175- 
193,  1995. 

10.  Pawlak,  Z.  Information  systems  -  theoretical  foundations.  Information  Systems,  6, 
205-218,  1981. 

11.  Pawlak,  Z.  Rough  Sets,  Theoretical  Aspects  of  Reasoning  about  Data,  Klumer  Aca¬ 
demic  Publishers,  Boston,  1991. 

12.  Pawlak,  Z.  Rough  modus  ponens.  Proceedings  the  7th  International  Conference 
on  Information  Processing  and  Management  of  Uncertainty  in  Knowledge- Based 
Systems,  1162-1166,  1998. 

13.  Pawlak,  Z.  Data  mining  -  a  rough  set  perspective,  LNAI  1574,  Springer-Verlag, 
Berlin,  3-12,  1999. 

14.  Pomykala,  J.A.  On  definability  in  the  nondeterministic  information  system,  Bul¬ 
letin  of  the  Polish  Academy  of  Sciences,  Mathematics,  36,  193-210,  1988. 

15.  Prade,  H.  A  computational  approach  to  approximate  and  plausible  reasoning  with 
applications  to  expert  systems,  IEEE  Transactions  on  Pattern  Analysis  and  Ma¬ 
chine  Intelligence,  PAMI-7,  260-283,  1985. 

16.  Vcikarelov,  D.  A  modal  logic  for  similarity  relations  in  Pawlak  knowledge  represen¬ 
tation  systems,  Fundamenta  Informaticae,  XV,  61-79,  1991. 

17.  Yao,  Y.  Y.  Interval-set  algebra  for  qualitative  knowledge  representation.  Proceedings 
of  the  Fifth  International  Conference  on  Computing  and  Information,  370-374, 
1993. 

18.  Yao,  Y.Y.  A  comparison  of  two  interval-valued  probabilistic  reasoning  methods. 
Proceedings  of  the  6th  International  Conference  on  Computing  and  Information, 
special  issue  of  Journal  of  Computing  and  Information,  1, 1090-1105  (paper  number 
D6),  1995. 

19.  Yao,  Y.Y.  and  Li,  X.  Comparison  of  rough-set  and  interval-set  models  for  imcertain 
reasoning,  Fundamenta  Informaticae,  27,  1996. 

20.  Yao,  Y.Y.  and  Noroozi,  N.  A  unified  framework  for  set-based  computations.  Soft 
Computing:  the  Proceedings  of  the  3rd  International  Workshop  on  Rough  Sets  and 
Soft  Computing,  252-255,  1995. 

21.  Yao,  Y.Y.  and  Zhong,  N.  Granular  Computing  Using  Information  Tables, 
manuscript,  1999. 


Many- Valued  Dynamic  Logic  for  Qualitative 
Decision  Theory 


Churn-Jung  Liau 

Institute  of  Information  Science 
Academia  Sinica,  Taipei,  Taiwan 
liaucjQiis. sinica.edu. tw 


Abstract.  This  paper  presents  an  integration  of  the  dynamic  logic  se¬ 
mantics  and  rational  decision  theory.  Logics  for  reasoning  about  the  ex¬ 
pected  utilities  of  actions  are  proposed.  The  well-formed  formulas  of  the 
logics  are  viewed  as  the  possible  goals  to  be  achieved  by  the  decision 
maker  and  the  truth  values  of  the  formulas  are  considered  as  the  utili¬ 
ties  of  the  goals.  Thus  the  logics  are  many- valued  dynamic  logics.  Based 
on  different  interpretations  of  acts  in  the  logics,  we  can  model  differ¬ 
ent  decision  theory  paradigms,  such  as  possibilistic  decision  theory  and 
case-based  decision  theory. 


1  Introduction 

Rational  decision  theory  is  a  very  important  research  topic  in  many  academic 
fields  such  as  economics,  politics,  and  philosophy.  Recently,  it  has  also  received 
more  and  more  attention  of  the  AI  community  due  to  the  development  of  intel¬ 
ligent  agent  systems.  The  basic  execution  loop  of  an  intelligent  agent  consists 
of  three  phases:  perception,  deliberation,  and  action.  In  the  perception  phase, 
the  agent  senses  the  status  of  the  environment  and  receives  information  from 
other  agents.  Then,  in  the  deliberation  phase,  the  agent  reasons  with  the  ob¬ 
served  and  received  information  and  plans  its  actions  for  achieving  its  goals. 
Finally,  in  the  action  phase,  the  plan  is  really  executed.  The  capabilities  of  both 
reasoning  about  actions  and  decision  making  are  crucial  to  the  success  of  the 
deliberation  phase  since  it  has  to  know  the  possible  effects  of  actions  and  select 
the  appropriate  actions  for  achieving  its  goals. 

A  variety  of  formalisms  for  reasoning  about  actions  have  been  developed  in 
AI,  theoretical  computer  science,  and  philosophical  logic.  Among  them,  dynamic 
logic  is  originally  proposed  for  reasoning  about  program  behavior  [10],  and  sub¬ 
sequently  adopted  for  reasoning  about  actions  by  the  AI  community.  Though 
the  advantages  of  using  dynamic  logic  for  reasoning  about  actions  have  been 
emphasized  in  [6],  the  traditional  dynamic  logic  has  only  limited  capability  in 
handling  uncertainty. 

In  dynamic  logic,  a  formula  [q:]<^  denotes  that  (p  holds  after  the  execution  of 
(possibly  compound)  action  a,  so  in  principle,  if  the  agent’s  goal  is  <p  and  \o\ip 
can  be  derived  from  the  description  of  the  initial  situation,  then  a  is  a  feasible 
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plan  for  achieving  the  goal.  This  gives  rise  to  a  decision  theoretic  reading  of 
dynamic  logic  semantics.  Since  nondeter  minis  tic  actions  are  allowed  in  dynamic 
logic,  it  may  capture  the  agent’s  ignorance  on  the  possible  effects  of  actions  to 
some  extent.  However,  uncertainty  pervades  the  whole  deliberation  phase,  so 
further  extensions  of  dynamic  logic  for  handling  different  forms  of  uncertainty 
are  needed.  In  general,  there  are  three  forms  of  uncertainty  in  the  deliberation 
process. 

—  The  perception  of  the  agent  may  be  imperfect  and  its  received  informa¬ 
tion  may  be  incomplete  and  faulty,  so  its  knowledge  in  the  status  of  the 
environment  is  uncertain.  Sometimes,  the  probabilistic  instead  of  the  exact 
knowledge  is  available.  However,  in  other  times,  even  probabilistic  knowledge 
is  not  available,  so  more  general  consideration  is  also  needed. 

—  In  multi- agent  systems,  our  agent  may  not  be  the  only  one  which  can  cause 

the  change  of  the  world,  so  it  has  only  partial  knowledge  about  the  possible 
effects  of  the  actions.  The  classical  dynamic  logic  may  handle  the  case  when 
the  knowledge  is  imprecise  (i.e.  the  effect  of  an  action  is  represented  as  a  set 
of  states).  However,  the  knowledge  may  be  also  probabilistic  or  possibilistic 
(i.e.  the  effect  of  an  action  are  represented  as  a  probability  or  possibility 
distribution  on  the  set  of  states).  We  recall  that  a  possibility  distribution  on 
a  set  X  is  a  mapping  tt  :  >  [0, 1]  such  that  7r(a:)  measures  the  extent  to 

which  X  is  likely  to  be  the  actual  consequence [14).  The  dynamic  logic  should 
be  extended  to  cover  such  cases. 

—  Since  dynamic  logic  is  two-valued,  the  goal  for  an  agent  to  achieve  must  be 
crisp  and  non-llexiVjle.  A  goal  is  either  satisfied  or  non-satisfied.  However, 
sometimes,  we  may  want  to  describe  more  flexible  goals.  A  goal  may  be  sat¬ 
isfied  to  some  degree.  In  decision  theory,  this  is  in  general  described  by  a 
real- valued  utility  function.  Recently,  more  general  notions  of  ordinal  pref¬ 
erence  are  considered [1, 13).  To  represent  the  flexible  goals  in  dynamic  logic, 
we  will  generalize  its  semantics  to  a  many- valued  one. 

On  the  other  hand,  in  the  decision  theoretic  contexts,  far  richer  notions  of 
uncertainty  have  been  explored.  Besides  classical  decision  theory,  in  which  the 
notions  of  probability  and  expected  utility  are  of  central  importance,  some  al¬ 
ternatives,  such  as  possibilistic  decision  theory [1],  case-based  decision  theory [7], 
and  belief  function-based  decision  theory [11]  have  been  proposed  and  axiomati- 
cally  justified  in  different  settings  recently.  The  main  concern  of  decision  theory 
is  to  choose  an  action  which  will  maximize  the  expected  utility  of  performing 
the  action  given  some  knowledge  on  the  effects  of  the  action  and  the  desirability 
of  these  effects.  In  the  extreme  case  that  the  utility  function  is  two-valued  and 
the  available  knowledge  is  imprecise,  this  is  just  a  rephrasing  of  the  decision 
theoretic  interpretation  of  dynamic  logic.  The  only  difference  is  that  in  general 
the  set  of  available  acts  in  decision  theory  is  not  algebraically  structured  as  in 
dynamic  logic.  Thus,  due  to  the  usefulness  of  dynamic  logic  in  reasoning  about 
actions  and  the  rich  notions  of  uncertainty  in  decision  theories,  the  combina¬ 
tion  of  decision  theory  and  dynamic  logic  semantics  will  have  the  advantages  of 
cross- fertilization . 
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In  this  paper,  we  suggest  a  kind  of  integration  between  decision  theoretic 
notions  and  dynamic  logic  semantics.  The  dynamic  logic  semantics  is  enhanced  to 
a  many- valued  one,  so  the  truth  value  of  a  formula  in  a  state  plays  a  two-fold  role. 
One  is  the  degree  of  satisfaction  of  the  formula  and  the  other  is  the  utility  of  the 
state.  Since  each  formula  corresponds  to  a  goal  and  we  can  consider  more  than 
one  formulas  in  the  same  time,  this  means  that  we  can  easily  describe  the  multiple 
objective  decision-making  in  the  logic.  In  classical  dynamic  logic,  each  action  is 
interpreted  as  a  binary  transition  relation  on  the  set  of  states.  Here,  depending 
on  the  different  uncertainty  handling  formalisms,  we  can  generalize  it  to  a  fuzzy 
relation,  a  set  of  probability  distributions,  or  a  set  of  possibility  distributions 
generated  from  a  similarity  relation.  Thus,  we  will  try  to  develop  several  many¬ 
valued  dynamic  logics  for  reasoning  with  different  uncertainty  formalisms. 

In  what  follows,  we  will  first  review  the  basic  notions  from  classical  deci¬ 
sion  theory  and  some  recent  proposals  of  qualitative  alternatives.  Then  the  dy¬ 
namic  logics  for  possibilistic  decision  theory  and  case- based  decision  theory  are 
considered  respectively.  Finally,  conclusion  is  given  and  some  possible  research 
directions  are  suggested. 


2  Review  of  Some  Decision  Theories 

Classical  quantitative  decision  theory  considers  expected  utility  maximization 
(BUM)  as  the  criteria  of  rational  choice.  In  the  theory,  a  decision  framework 
is  a  4- tuple  (P,  X, /i,w),  where  D  is  a.  set  of  available  decision  acts,  X  is  a  set 
of  possible  outcomes,  :  D  X  X  — >  [0, 1]  assigns  to  each  decision  act  d  €  D 
a  probability  distribution  •)  on  X,  and  u  :  X  ^  is  the  utility  function. 
Then  the  expected  utility  of  a  decision  d  is  defined  as 

U{d)  =  ’u{x), 

and  the  decision  maker  will  choose  do  such  that  U(do)  =  maxdeD  U(d). 

While  the  computation  of  E{d)  relies  on  the  arithmetic  operations  (mainly  -|- 
and  •)  on  real  numbers,  qualitative  decision  theory  concentrates  more  on  the  de¬ 
cision  maker’s  ordinal  preference  and  uncertainty  about  the  possible  outcomes. 
Recently,  a  qualitative  decision  theory(PODT)  based  on  possibilistic  logic  is  pro- 
posed[l].  In  the  theory,  a  possibilistic  decision  framework  is  a  4-tuple  {D,  X,  tt,  u), 
where  D  and  X  are  defined  as  above,  tt  :  DxX  — >  Ti  assigns  to  each  decision  act 
d  €  D  a.  possibility  distribution  7r(d,  •)  :  X  ^  Ti,  and  u  :  X  ^  T2  is  the  utility 
assignment  function.  Here,  Ti  and  T2  are  linearly  ordered  scales,  and  under  the 
commensurability  assumption,  we  can  assume  T\=T2  =  T  without  loss  of  gen¬ 
erality.  Typical  examples  of  T  are  [0,1]  or  a  subset  of  [0,1].  Let  n  :  T  — ^  T  be  an 
order-reversing  map  on  T,  then  two  qualitative  expected  utilities  for  a  decision 
d  can  be  defined.  For  the  risk-averse  decision  maker,  the  pessimistic  expected 
utility  is 

[/*(d)  =  min  max(n(7r((i,  x)),  w(x)). 
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and  for  a  risk-prone  decision  maker,  the  optimistic  expected  utility  is 
U*{d)  =  maxmm(7r(d,  a:),7x(a:)). 

The  PODT  is  particularly  suitable  for  complicate  situations  in  which  complete 
probabilistic  information  is  rarely  available. 

Another  recent  proposed  alternative  decision  theory  is  the  case-based  one. 
According  to  [7],  the  purpose  of  case-based  decision  theory (CBDT)  is  to  mod¬ 
el  decision  making  under  uncertainty  by  formalizing  reasoning  by  analogies.  It 
suggests  that  decision  makers  tend  to  choose  actions  which  performed  well  in 
similar  past  cases.  Each  case  is  viewed  as  a  triple  of  a  situation  (i.e.  a  decision 
problem),  the  action  chosen  in  it,  and  the  consequence  of  performing  the  action. 
Thus,  a  case-based  decision  framework  is  a  6-tuple  (P,  D,  X,  C,  s,  u),  where  D, 
X,  and  u  are  defined  as  above,  P  is  a  set  of  situations,  CCPxPxXisa  finite 
set  of  cases,  called  the  memory  of  the  decision  maker,  and  s:PxP— >  [0,1]  is  a 
similarity  function  which  measures  the  similarity  between  situations.  Then,  for 
a  given  situation  p,  the  expected  utility  for  a  decision  d  is  defined  as 

Uc{<i)= 

{q,d,x)ec 

However,  it  is  also  pointed  out  that  Uc  is  cumulative  in  nature,  so  the  number  of 
times  a  certain  act  was  chosen  in  the  past  will  affect  perceived  desirability.  Thus, 
an  act  that  was  chosen  repeatedly  producing  bad  results  may  be  considered 
superior  to  an  act  that  was  chosen  only  once  but  producing  good  result.  To 
overcome  the  difficulty,  the  average  utility  is  considered  in  [7],  namely,  in  the 
above  equation,  the  similarity  function  s  is  replaced  by  s'  which  is  defined  as 

/®(P'«')/E(,<.d,x)6C-’(P'9')if  well-defined 

otherwise 

In  [2],  a  more  qualitative  version  of  expected  utility  is  considered  which  can  also 
eliminate  the  cumulation  assume  the  utility  values  are  normalized  to  the  range 
[0, 1] .  The  definition  is  analogous  to  the  pessimistic  and  optimistic  expected 
utility  in  the  PODT. 

Uc*(d)  =  min  max(n(s(p,  7)),  w(a:)), 

(q,d,x)eC 

max  min(s(p,9),u(x)). 

{q,d,x)eC 

While  in  PODT,  it  is  observed  that  the  criterion  of  maximizing  U*{d)  is  some¬ 
times  over-optimistic[5],  it  seems  that  for  CBDT,  the  pessimistic  utility  has  some 
counterintuitive  results.  For  example,  if  an  act  a  was  only  adopted  in  the  past 
for  the  cases  that  are  completely  different  with  the  present  situation,  i.e.,  for  all 
(g,  a,x)  G  C,  s(p.q)  =  0,  then  we  will  have  Uc*{a)  =  1  which  is  the  maximum 
value.  This  phenomenon  is  due  to  the  fact  that  the  case  memory  is  only  a  partial 
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description  of  the  world,  so  we  may  encounter  a  novel  situation  in  which  no  past 
experience  can  be  followed.  In  CBDT,  we  tend  to  find  the  most  similar  past  case 
and  apply  its  solution  to  the  new  situation,  so  the  optimistic  criterion  seems 
more  suitable. 

3  Possibilistic  Dynamic  Logic 

The  possibilistic  dynamic  logic  (PoDL)  provides  the  integrated  treatment  of 
possibilistic  decision  theory  and  dynamic  logic.  The  syntax  and  semantics  of 
PoDL  is  an  extension  of  that  for  dynamic  logic [10]  and  fuzzy  modal  logic  ShMV 
in  [8],  which  in  turn  bases  on  rational  Pavelka  logic  proposed  in  [9]. 


3.1  Syntax 

The  alphabet  of  PoDL  consists  of 

1.  A  set  of  propositional  letters,  PV  =  {p,  q, . . .}, 

2.  a  set  of  atomic  actions,  A  =  {a,  6,  c, . . .}, 

3.  the  set  of  truth  constants  r  for  each  rational  r  €  [0, 1],  and 

4.  the  logical  symbols  A,  V,  U,  and  ?. 

The  set  of  well- formed  formulas (^)  and  the  set  of  action  expressions(S')  are 
defined  inductively  in  the  following  way. 

1.  ^  is  the  smallest  set  such  that 

-  PV  C  ^  and  r  €  for  all  rational  r  e  [0, 1],  and 

-  if  (f.'ip  and  a  G  H,  then  ~  ip  A'lp.ipV  ip,  \a]ip  € 

2.  H  is  the  smallest  set  such  that 

—  A  C  H,  and 

-  if  a,  13  £  E  and  (p£^,  then  (3,  a*,  </??  G  i7. 

Some  abbreviations  of  PoDL  include  -ly?  ~ip—^f^^ip®xp  =  -^(ip  “’V')? 

ip^xp  ~  -up  -0,  and  {oi)(p  = 


3.2  Semantics 


The  semantics  of  PoDL  is  defined  relative  to  a  given  Kripke  structure  M  = 
(VP,  r,  I  •  |,tyo)i  where  VP  is  a  set  of  possible  worlds,  r  :W  X  PV  [0, 1]  is  the 
truth  valuation  function,  |  •  |  :  A  — >  (VP  x  VP  — ►  [0,  Ij)  is  the  action  denotation 
function,  and  G  VP  is  a  designated  world.  The  mappings  r  and  |  •  |  is  extended 
to  ^  and  E  as  follows. 


1.  T(«;,r)  =r, 

«  /  \  r  0  if  r(xv.  (p)  —  1 

2.  ^K~V)  =  |iotherwise 

3.  T{w,<p^xp)=^l{r{xv,(p),T{w,xp)), 

4.  t{w,<p  Axp)  —  mm{r{xVyip)^T{w,xp)), 
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5.  t{w^  (fV  Ip)  —  max(T(T/;,  (/?),  t{w^  ^)), 

6.  T(w,[a](p)  =  inflow  niax(l  “  |a|(?/;,a:),r(a:,(^)), 

7.  |a;/?|  =  Ja|  o  j/?|,  i.e,,  composition  of  two  fiizzy  relations  |q!|  and  |/?J, 

8.  |aU/?l  =  |alUl^|, 

9.  |a*|  =  1»|*,  i-e,,  the  reflexive  and  transitive  closure  of  |aj, 

(lu,  <^)  if  w  ~  X 

otherwise 


10.  |v5?l(u),x)  = 


where  I  :  [0, 1]  x  [0, 1]  — >  [0, 1]  is  an  implication  function.  Typical  implication 
functions  include  material  implication  I{x^y)  =  max(l  —  a:,i/),  Lukasiewicz’s 
implication  I(x,y)  ~  min(l,l  —  x  -\-  ?/),  and  Godel  implication  I{x,y)  =  1  if 
X  <y  and  —  y  if  x  >  y.  Here  we  will  let  I  denote  the  Lukasiewicz’s  implication. 

Let  w  \==M  (p  denote  =  1,  then  (p  is  true  in  M,  written  as  M  |=  if 

"^0  \=M  ^  and  for  a  set  E  of  wffs,  we  write  M  |=  Z"  if  M  |=  (/?  for  all  (/?  € 
Furthermore,  <p  is  said  to  be  an  (external)  logical  consequence  of  Z,  denoted  by 
Z  1=  (/?,  if  for  any  model  M,  M  |=  Z  implies  M  [=  When  Z  is  the  empty  set, 
this  is  abbreviated  as  |=  (/?  and  (p  is  said  to  be  valid.  Moreover,  ip  is  satisfiable  if 
^  ip  \s  not  valid  and  weakly  satisfiable  if  -up  is  not  valid. 


3.3  Discussion 

In  the  semantics  above,  if  we  consider  3  as  the  set  of  available  acts,  and  ^  the 
set  of  goals,  then  a  Kripke  structure  is  a  generalization  of  possibilistic  decision 
framework.  It  can  be  seen  that  r(iy,  [ajv?)  and  T{wy  (a)v?)  are  respectively  the 
pessimistic  and  optimistic  utility  of  doing  a  under  state  w  with  respect  to  the 
decision  objective  ip. 

To  see  how  the  PoDL  model  generalize  a  possibilistic  decision  framework,  let 
us  consider  the  formal  correspondence  between  them.  Let  M  =  (W,  r,  |  •  |,  tL^o)  be 
a  PoDL  model  and  ^phe^,  fixed  wff,  then  Z)(M,  ip)  =  (Z>,  X,  tt,  u)  is  a  possibilistic 
decision  framework,  where 

-  D  =  3, 

-  Z  -  W, 

—  TT  :  D  X  X  [0, 1],  7r{a,w)  =  for  all  a  €  D  and  w  £ 

—  w  :  X  — >  [0, 1],  u{w)  =  t{vj,  ip)  for  all  to  G  Z. 

The  difference  is  made  explicit  from  the  formal  correspondence.  First,  in  a  possi¬ 
bilistic  decision  framework,  the  set  of  available  decision  acts  is  taken  a^s  primitive, 
so  it  is  only  needed  to  specify  the  possible  effects  of  each  act  from  the  initial 
situation  wq,  which  is  implicitly  assumed.  On  the  other  hand,  in  PoDL,  the  set 
of  actions  is  composed  from  some  atomic  ones,  so  to  know  the  effects  of  an  action 
under  the  initial  situation,  we  have  to  know  also  effects  of  its  constitutive  ac¬ 
tions  under  different  situations.  In  other  words,  the  decision  maker  choose  plans 
instead  of  a  single  action  in  PoDL  model.  Second,  in  a  possibilistic  decision 
framework,  a  utility  function  is  given,  which  is  implicitly  assumed  to  correspond 
to  a  goal  of  the  decision  maker,  whereas  in  PoDL,  we  have  a  bundle  of  utility 
functions,  each  corresponding  to  a  wff  of  the  language.  Thus,  we  can  know  the 
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utilities  of  not  only  some  primitive  goals  p  and  q,  but  also  the  conjunctive  goal 
pf\q^  the  aggregated  goal  p^q,  the  negated  goal  -tp,  and  so  on.  This  is  in  partic¬ 
ular  suitable  for  multiple  objectives  decision  making.  Moreover,  because  of  the 
use  of  rational  Pavelka  logic,  the  prioritized  or  thresholded  goals  mentioned  in  [3] 
can  also  be  expressed  in  PoDL.  For  example,  we  can  write  goals  like  (p  V  (1  —  r) 
or  r  — + 

When  we  can  completely  specify  a  possibilistic  decision  framework  or  a  PoDL 
model  M,  then  the  decision  making  process  amount  to  the  model  checking  prob¬ 
lem  in  M.  For  example,  if  M  |=  r  ->  [a](p,  then  we  know  that  for  goal  (p,  the  plan 
a  has  pessimistic  expected  utility  at  least  r.  If  M  ([q:](P  — ^  ^  is 

a  better  plan  than  13  for  satisfying  (p  according  to  the  criterion  of  maximizing 
Lf*.  Sometimes,  it  is  not  easy  to  have  a  complete  specification  of  the  decision 
framework.  Instead,  we  may  have  only  some  partial  description  of  the  status  of 
the  environment  and  the  preconditions  and  effects  of  the  primitive  actions.  Some 
typical  sentences  for  the  description  are  non- modal  wffs  of  PoDL,  or  formulas  of 
the  form  (p  {r  [(i]ip)  where  a  is  atomic  and  (p  and  are  non- modal.  In  this 
case,  assume  D  is  the  set  of  descriptions,  then  the  decision  making  process 
amount  to  deduction  problem  in  the  logic.  We  must  try  to  derive  the  formulas 
like  r  — ^  [a](/?  or  ^  [/?](/?)  from  D  by  proof  methods  of  the  logic.  Though 

the  development  of  proof  methods  for  the  logic  is  beyond  the  purely  semantic 
concern  of  the  paper,  it  is  indeed  a  very  interesting  direction  for  further  research. 


4  Case-based  Dynamic  Logic 

Analogous  to  PoDL^  in  this  section,  we  develop  a  case-based  dynamic  \ogic(CbDL) 
for  reasoning  about  actions  and  decisions  according  to  CBDT.  Though  the  syn¬ 
tax  of  ChDL  is  similar  to  that  of  PoDL,  we  will  add  a  similarity-based  modal 
operator  which  is  of  independent  interest  to  fuzzy  reasoning [4, 8].  Furthermore, 
a  more  classical  dynamic  operator  {■}  is  used  for  describing  primitive  actions  in 
case  bases. 

Thus,  the  alphabet  of  ChDL  consists  of  those  for  PoDL  and  three  additional 
logical  symbols  {, },  V  and  the  following  formation  rules  are  added  to  those  for 
PoDL, 

-  if  v?  G  ^  and  a  G  A,  then  {a}ip  and  Vip  G 

To  define  the  semantics  for  ChDL,  we  first  recall  the  definition  of  fuzzy 
similarity  relation.  A  fuzzy  relation  5  :  X  x  AT  [0, 1]  is  a  similarity  one  if  it 
satisfies  the  following  three  properties,  for  a\\  x,y  €  X , 

1.  reflexivity:  S(x,x)  =  1, 

2.  symmetry:  S{x,y)  —  S{y,x),  and 

3.  transitivity:  5 (.T,j/)  >  sup^g;^  min(S'(a:,  2:),  5(z,  y)). 

Then  a  ChDL  model  is  a  5-tuple  M  =  (W,t,  |  •  |o,  S,wo),  where  W,  r,  and  wq 

are  as  above,  |  •  |o  ^  ^  2^^^,  and  5  :  IF  x  IF  (0, 1]  is  a  similarity  relation. 
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The  mapping  I  ■  I  :  A (W  x  W  [0,1])  is  defined  by 
|a|(.T,2/)=  sup  S(x,z). 

z^W,{z,y)e\a\Q 

That  is,  |a|  =  5  o  \a\o.  Then  the  mappings  r  and  1  •  1  is  extended  to  ^  and  E 
as  in  PoDL  with  the  extra  rules  for  wffs  of  the  forms  Vy?  and  {a}v?, 

10.  t{w,  V(f)  =  infxew  niax(l  -  S(w,  x),  t{x,  (p)), 

11.  T(w,{a}(p)  =  infiew,(«;,x)€|alo'^(^>V?)’ 

The  definitions  of  logical  consequence,  validity,  and  satisfiability,  etc.  are  analo¬ 
gous  to  those  of  PoDL.  Sometimes,  to  distinguish  the  logical  consequence  rela¬ 
tions  between  PoDL  and  ChDL^  we  will  add  the  subscripts  to  them. 

The  mapping  j  ■  |o  is  to  model  the  case  memory,  The  definition  is  such  that 
a  case  (a:,  a,t/)  is  in  the  memory  iff  (a:,?/)  G  fa  jo-  This  restricts  that  the  actions 
appearing  in  the  memory  must  be  atomic.  This  restriction  is  not  so  restrictive 
as  it  seems  at  the  first  glance.  Imagine  that  the  agent  has  a  detail  trace  of  the 
execution  of  the  actions  in  the  past  cases.  Then  for  a  compound  action  like 
a;/?,  if  we  know  the  intermediate  state  after  the  execution  of  a,  then  we  can 
decompose  a  case  (a*, a; /?,?/)  into  two  cases  (xya^z)  and  {zyf3^y),  and  store  the 
latter  two  on  the  memory.  According  to  the  original  restriction  of  case  memory 
in  [7),  there  do  not  exist  two  cases  (p,  a,  x)  and  (p,  a',x')  in  the  memory  such 
that  a  ^  a'  or  X  ^  x',  so  we  can  also  require  that  |a|o  is  a  partial  function  in 
the  ChDL  model.  However,  for  generality,  we  do  not  impose  the  restriction  on 
the  models. 

The  definition  of  [  •  |  from  1  •  |o  and  S  make  the  following  a  valid  axiom 
schema  in  CbDL.  That  is, 

[=  [a](p  =  V{a}(p, 

for  a  G  A  and  (p  € 

Apparently,  the  CbDL  and  PoDL  models  have  some  correspondence.  In  the 
semantics  for  CbDL^  we  have  constructed  |  ■  |  from  |  •  |o  and  5,  if  we  then  ignore 
the  latter  two  components,  a  PoDL  model  is  obtained.  Since  the  language  of 
PoDL  is  a  sublanguage  of  CbDL^  this  means  that  CbDL  is  an  extension  of 
PoDL.  Namely,  if  Z"  is  a  set  of  wffs  of  PoDL  and  cp  is  a  wff  of  PoDL^  then 
D  1=  PoDL  <P  implies  E  \=cbDL  <P- 

However,  unlike  PoDL,  we  can  not  find  a  direct  correspondence  between 
CbDL  models  and  CBDT  framework.  This  is  due  to  the  fact  that  in  CBDT,  the 
set  of  situation  P  and  the  set  of  consequence  X  are  not  necessarily  the  same, 
whereas  in  CbDL  models,  we  model  the  past  cases  by  a  set  of  binary  relations 
on  W,  so  W  plays  both  the  roles  of  P  and  A".  To  transform  a  CBDT  framework 
(P, D,  A,  C,  s, u)  into  a  CbDL  model,  we  can  let  ly  ==  P  U  A,  and  then  extend 
the  similarity  function  s  to  ly  x  ly  and  the  utility  function  u  to  ly.  However, 
is  is  likely  that  s  is  not  well-defined  outside  P  X  P  and  u  is  not  definable  in  P . 
In  this  case,  a  simple  approach  is  just  let  u{p)  =  0  for  all  p  G  P  and  s{x,y)  —  0 
for  X  G  A  or  y  €  A.  Then  the  extended  s  is  just  the  similarity  fuzzy  relation 
S  and  for  the  extended  u,  u{tD)  is  the  truth  value  T{w,(p)  for  some  fixed  goal 
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ip.  The  mapping  |  •  |o  is  derived  from  the  case  memory  C,  i.e.,  (p^x)  G  |aIo  if 
(p,  a,  x)  G  C.  If  the  process  of  extending  s  and  u  does  not  distort  the  original 
CBDT  framework,  then  we  have  a  complete  speciflcation  of  ChDL  model.  This 
reduces  the  process  of  reasoning  about  actions  and  decision  in  the  original  CBDT 
framework  to  the  model  checking  problem  in  CbDL  as  in  the  case  of  PoDL. 

On  the  other  hand,  from  a  more  practical  viewpoint,  we  may  have  only  a 
partial  description  of  the  whole  CBDT  framework  from  the  beginning.  In  this 
case,  we  assume  there  is  a  subset  of  crisp  propositional  symbols  PVq  C  PV  for 
describing  the  case  memory.  Let  is  the  set  of  sentences  resulting  from  Boolean 
combinations  of  symbols  in  PVq.  Then  in  general,  we  have  four  sets  of  proper 
axioms  for  the  description  of  the  framework.  The  first  one  Eg  is  to  describe  the 
similarity  function,  so  each  sentence  in  it  is  of  the  form  (p  — ^  (r  V'0)  for  some 
(p,ip  e^Qj  the  second  is  Ec  for  the  case  memory,  so  its  sentences  are  in  the  form 
of  (f  ^  {a}'0  for  some  <p,  ^  G  and  a  e  A,  the  third  is  Eu  which  specifies  the 
utility  functions,  so  each  sentence  is  of  the  form  (p  {r  ip)  for  some  (p  G 
and  Ip  and  the  last  is  Eq  =  {p\/ \  p  e  PVq}  to  enforce  that  each  p  G  PVq 
is  two- valued.  These  four  sets  are  proper  axioms  instead  of  premises  because  we 
require  that  they  are  true  not  only  in  wq  but  also  in  all  possible  worlds  of  a 
model.  Let  I?  =  U  Z'c  U  U  Z'o  and  suppose  the  agent  faces  a  new  problem 
described  by  a  set  of  (possibly  just  propositional)  wffs  27,  then  our  problem  is  a 
theorem-proving  one.  For  example,  if  we  have  E  |=x?~  {{oi)(p  then  a 

will  be  a  better  plan  for  the  goal  (p  with  respect  to  the  criterion  of  maximizing 
U*,  where  means  the  logical  consequence  relation  in  a  CbDL  system  with 
i?  as  the  set  of  proper  axioms. 


5  Future  Works  and  Conclusion 

We  have  outlined  two  logical  languages  for  reasoning  about  actions  and  decisions 
based  on  the  dynamic  logic  semantics  and  decision  theory  framework.  One  is 
based  on  possibilistic  decision  theory  and  the  other  on  case-based  decision  theory. 
The  language  for  PoDL  is  a  subset  of  CbDL  and  it  is  shown  that  CbDL  is  a 
conservative  extension  of  PoDL.  Both  logics  can  be  used  in  two  ways.  When  we 
have  a  complete  specification  of  the  decision  frameworks,  the  logic  can  be  used 
in  a  model  checking  way  and  if  we  have  only  a  partial  description  of  the  problem, 
then  the  logic  should  be  used  in  a  theorem  proving  way. 

As  mentioned  above,  since  theorem  proving  is  in  general  more  difficult  than 
model  checking,  the  development  of  theorem  proving  methods  for  both  logics  is 
the  first  demanding  problem  for  further  research. 

Second,  while  the  logics  developed  here  are  mainly  based  on  qualitative  de¬ 
cision  theories,  we  would  also  like  to  develop  similar  logics  for  quantitative  de¬ 
cision  theories.  In  particular,  probabilistic  dynamic  logics  (12)  should  be  a  good 
starting  point.  However,  since  these  logics  are  aimed  at  reasoning  about  the  be¬ 
havior  of  probabilistic  algorithm,  they  are  still  two- valued,  so  the  generalization 
to  many-valued  ones  are  needed.  If  this  is  successful,  we  can  model  the  classical 
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decision  framework  as  well  as  the  original  quantitative  criterion  of  maximizing 
Uc  in  CBDT. 

In  conclusion,  the  results  reported  in  this  paper  is  just  at  the  early  stage  of 
a  long-term  goal  to  integrate  logical  reasoning  and  decision  theories.  We  expect 
the  cross-fertilization  of  both  fields  can  result  from  the  research. 
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Abstract.  This  paper  is  concerned  with  a  preliminary  consideration 
to  provide  the  formal  specification  of  language  of  knowledge  processing 
system  SKAUS  (Super  Knowledge  Acquisition  and  Utilization  System) 
which  incorporates  uncertain  knowledge  processing  and  non-symbolic  in¬ 
formation  processing  units  in  the  system.  SKAUS  is  planned  as  a  super 
set  of  KAUS  developed  by  the  authors.  KAUS  implement  multi-layer 
logic  (MLL  for  short)  based  on  classical  set  theory.  SKAUS  is  intended 
to  have  additional  capabilities  of  KAUS,  such  as  representing  uncertain 
knowledge  in  the  forms  of  language  used  in  fuzzy  set  theory  and  proba¬ 
bility  theory.  In  addition  to  this  extension,  we  try  to  incorporate  matrix 
logic  into  our  extension  so  as  to  process  non-symbolic  information  in 
corporation  with  neural  networks. 


1  Introduction 

For  the  practical  AI  systems,  the  ability  of  reasoning  with  uncertainty  is  very 
important  [1-3].  For  example,  in  the  application  of  AI  technology  to  problem 
domains  of  diagnosis,  control  and  prediction,  the  systems  are  required  to  have 
the  facility  of  reasoning  with  uncertainty  because  these  domains  are  usually  ill- 
defined  and  so  the  problems  and  the  solving  method  could  not  be  well  described. 
Another  example  is  seen  in  intelligent  information  retrieval  systems.  The  users 
sometimes  pose  ambiguous  queries  to  the  retrieval  systems.  In  this  case,  the 
systems  have  to  resolve  ambiguities  involved  in  the  queries  so  that  the  systems 
can  retrieve  the  users’  surely  desired  data. 

The  aim  of  this  paper  is  to  give  a  preliminary  consideration  to  extend  MLL  [5, 
6]  so  that  we  can  handle  reasoning  with  uncertainty  by  incorporating  fuzzy  sets 
originated  by  Zadeh  [4]  into  the  extended  MLL.  In  addition  to  this  extension,  we 
try  to  incorporate  matrix  logic  into  our  system  so  as  to  process  non-symbolic  in¬ 
formation  in  corporation  with  neural  networks.  We  have  a  plan  to  extend  KAUS 
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[10]  as  SKAUS  (super  knowledge  acquisition  and  utilization  system)  which  en¬ 
ables  us  soft  computing  with  these  extensions.  We  introduce  here  MLL  and 
KAUS  shortly.  More  details  of  MLL  are  discussed  in  the  literatures  [5,6]. 

In  1985  we  have  formalized  multi-layer  logic  (MLL  for  short)  which  is  an 
extended  version  of  first  order  logic.  MLL  was  formulated  as  a  formal  system 
for  constructing  general  purpose  knowledge  processing  systems.  Though  ordi¬ 
nary  first  order  logic  does  not  assume  any  structural  constraints  to  variable  and 
constant  terms,  these  in  MLL  may  be  structured  in  the  set  hierarchy.  The  sets 
treated  in  MLL  are  crisp  sets  based  on  axiomatic  set  theory,  specifically,  based 
on  admissible  sets  [7]. 

Adopting  MLL  as  the  theoretical  basis,  we  have  developed  KAUS  (knowledge 
acquisition  and  utilization  system)  and  it  is  used  as  a  tool  for  building  knowledge- 
based  systems.  Until  now  we  have  applied  KAUS  to  various  model  building  and 
evaluation  by  computer  [8,9]. 

Rules  described  in  KAUS  language  are  not  restricted  Horn  clauses  but  arbi¬ 
trary  AND-OR  clauses.  Variables  appearing  in  KAUS  clauses  may  be  universally 
quantified  or  existentially  quantified  with  type  restrictions.  For  example,  (1)  Ev¬ 
ery  boy  likes  a  girl,  (2)  The  age  of  John  is  24  or  25,  (3)  If  a  person  X  does  not 
have  his  own  car,  then  X  is  not  a  car  driver  or  X  is  a  paper-driver,  and  (4)  If 
each  member  Y  of  the  students  X  who  is  a  group  interested  in  computer  science 
learns  a  programming  language  are  respectively  expressed  in  KAUS  as  follows. 

(1)  . [AX/boy] [EY/girl] (like  X  Y) . 

(2) .(1  (age  John  24)  (age  John  25)). 

(3)  .  [AX/person]  [ACar/car] 

(I  (I  "(caxDirver  X)  (paper-driver  X)  )  (have  X  Car)). 

(4)  . [AX/*student] [AY/X] [EL/programmingLanguage] 

(I  (learn  Y  L)  "(interestedGoroup  X  computerScience) ) . 

As  seen  above,  we  represent  a  clause  A  B  by  -lA  V  B. 

We  have  implemented  inference  rules  given  in  [5, 6]  with  the  unification  al¬ 
gorithm  based  on  resolution  principle,  in  which  if  the  two  variables  to  be  unified 
are  typed  variables  (as  seen  in  the  above  example),  type  unification  is  also  per¬ 
formed  [10].  Relating  to  uncertain  reasoning,  disjunctive  logic  programming  [11], 
though  in  the  limited  way,  and  building  models  of  possible  worlds  using  the  world 
constructor  of  KAUS  are  possible. 

2  Incorporating  Fuzzy  Set  Theory  into  MLL 

The  most  primitive  concept  in  a  set  theory  and  so  in  MLL  is  the  membership 
relation  that  an  element  x  belongs  to  a  set  A.  In  the  classical  set  theory  we 
write  this  relation  as  a;  6  A.  The  truth  value  of  a:  G  A  is  often  described  using 
its  characteristic  function  (j)(x)  such  that  ^(a:)  =  1  if  and  only  if  a:  e  A,  and 
0(a:)  =  0  if  and  only  if  a;  ^  A.  For  example,  John  is  a  student  will  be  described 
as  John  G  student  and  (l>student{John)  =  1.  How  about  John  is  a  tall  student? 
We  might  write  it  as  John  G  tall  student  and  student  {John)  =  1.  However, 
if  we  want  to  classify  Jim  as  Jim  G  tall  student  from  the  fact  Jim  is  a  student 
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and  his  height  is  180cm,  we  have  to  clarify  the  concept  tall  Since  HalV  is  a 
vague  concept,  we  cannot  define  exactly  what  is  meant  by  tall  In  the  approach 
of  classical  logic  and  so  MLL,  we  can  only  heuristically  or  subjectively  define 
tallness  in  a  logical  form  such  as 

(Vx  €  student)[height{x,  h)  A  h  >  175  tall{x)] 

Prom  this  we  can  say  that  the  set  of  all  tall  students  (denoted  by  talLstudent) 
is  the  subset  of  the  set  student  that  satisfies  the  above  relation.  This  is  a  defi¬ 
nition  from  the  intensionallity  of  a  set.  Another  definition  is  possible  from  the 
extensionality  of  a  set.  In  the  extensional  definition  of  a  set,  we  explicitly  enu¬ 
merate  all  elements  of  the  set.  For  example,  tall  student  -  {John,  Jim, 

The  enumerated  elements  are  thought  of  having  the  common  properties  and 
attributes.  There  is  no  ambiguity  in  the  definitions  of  intensionality  and  exten¬ 
sionality  of  sets  in  the  classical  set  theory.  Furthermore  it  seems  that  classical 
set  theory  is  enough  for  describing  all  things  including  ambiguous  and  vague 
concepts  in  such  a  way.  All  descriptions  by  the  classical  theory  can  be  evalu¬ 
ated  exactly  true  or  false.  So,  one  could  say  that  MLL  is  enough  and  there  is 
no  problem  in  MLL.  However  if  we  describe  ambiguous  and  vague  concepts  in 
MLL  cooperating  with  frizzy  set  theory,  such  MLL  will  become  more  practical 
theory  because  fuzzy  set  theory  is  very  practical  and  intuitive  theory  for  real 
applications.  In  the  following  we  describe  extended  MLL  from  the  point  of  views 
describe  above. 

2.1  Extending  Set  Relationships  in  MLL 

Our  criterion  of  incorporating  fuzzy  sets  into  MLL  is  that  set  relationships  de¬ 
fined  in  MLL  are  special  cases  in  that  the  frizzy  set  membership  functions  are 
restricted  to  the  extreme  points  {0,1}  of  [0,1].  Because  of  this  we  adopt  a-level 
sets  to  define  set  relationships.  The  a-level  set  Aq  of  a  fuzzy  set  A  is  defined  as 
follows  [12,13]. 

Definition  1.  (a-cuts)  Given  a  G  [0,1]  and  the  membership  function  pA  of  a 
fuzzy  set  A,  we  define  a-level  sets  Aa  of  A  from  the  following  a-cuts  (a)  or 
strong  a-cuts  (b). 


(a) .  Aa^ixe  U\pa{x)  >  a} 

(b) .  Ac  =  {xe  U\pa{x)  >  oi} 

We  can  reconstruct  pa  from  the  family  of  a-level  sets  A^  of  A  : 

Pla{x)  =  sup{a\x  eAoc}  (2) 

An  a-level  set  is  a  crisp  set  and  the  €  -relation  used  in  (2)  is  the  ordinary 
membership  relation.  Since  as  described  earlier  in  this  chapter  the  membership 
relation  is  the  most  fundamental  relation  in  a  set  theory,  we  define  the  similar 
€  -relation  for  fuzzy  sets. 
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Definition  2.  (e^  -relation)  We  write  x  A  iff  x  belongs  to  an  aAevel  set 
Aa  of  a  fuzzy  set  A,  and  x  A  iff  x  does  not  belong  to  an  a-level  set  Aa  of  a 
fuzzy  set  A. 

{X  ^ct  A  iff  X  G  Aqi  /g\ 

X^aA  iffx  4-  Aa 

The  inclusion  relation  between  fuzzy  sets  is  defined  using  a-cuts  of  fuzzy 
sets. 

Definition  3.  (inclusion  relation) 

AaQBfs  :  (Va:)[x  e  Aa x  £  Bp]  ^  (3  <  a  (4) 

Union,  intersection,  and  complementation  operators  are  defined  as  follow. 

Definition  4.  (union  and  intersection  operators)  Given  X  whose  elements  are 
fuzzy  sets  and  a,/3  e  [0, 1],  we  define  union  and  intersection  operators  as  follows. 

union  UX  :  [x  €«  UX  <->•  (3Z)(3/?)[/3  >  a  Ax  Z  A  Z  £  X]]  (5) 

intersection  OX  :  [x  £a  HX  <->  (VZ)(3/3)[/?  <  a  Ax  £p  Z  A  Z  £  X]]  (6) 

For  example,  if  X  =  UX  =  AU  B.  The  membership  function  of 

AuB  is  defined  from  the  a-cuts  given  in  (1)  and  (2).  Typically,  the  membership 
function  of  AuB  is  a  s-norm  (t-conorm)  such  that  paub(x)  =  ®{pa(x), P>b{x)) 
where  each  membership  function  of  A  and  B  is  defined  from  some  l3A-cuts  and 
pB-cuts  such  that  0a  <  ot  and  0b  <  Ci.  The  membership  function  of  A  H  5  is 
given  by  the  t-norm  such  that  PAnB{x)  =  <S>(M>i(a:),/XjB(x)). 

Definition  5.  (complementation)  Given  A  as  a  fuzzy  set  and  a  £  [0,1],  we 
define  the  complement  set  of  A  as  follows. 

complementation  A:  [x  £a  A  ^  x  £i-a  A]  (7) 

Definition  6.  (powerset)  Given  a  fuzzy  set  X,  we  define  the  powerset  of  X  as 
follows. 

powerset  *X  :  [Y  G  *X  ^  T  C  X]  (8) 

These  definitions  (1)  -  (8)  are  used  for  defining  the  inference  rules  of  the  extended 
MLL. 


2.2  Extending  Inference  Rules  in  MLL 

In  a  fuzzy  system,  a  typical  pattern  of  the  fuzzy  inference  rules  is  expressed  like 

if  X  is  A,  then  x  is  B 

gis  A* _  (9) 

a  is  B* 

where  x  is  a  variable  and  a  is  a  constant.  A,  A*,  B  and  B*  represent  fuzzy 
predicates  (fuzzy  sets)  [14].  For  example,  from  (9),  if  Hfx  is  tall,  then  x  is  heavy^ 
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and  ^John  is  very  talV  are  given  as  premises,  we  can  conclude  that  ^John  is 
very  heavy'.  Zadeh  defined  such  a  generalized  modus  pones  rule  [14].  However,  it 
should  be  noted  here  that  there  exists  the  difference  between  'if  x  is  A,  x  is  B' 
and  'if  the  more  x  is  A,  the  more  x  is  B'  and  it  should  be  clarified  in  the  inference 
system  [14].  We  agree  to  this  and  our  inference  system  will  reflect  this  agreement. 
As  for  fuzzy  unification,  unification  rules  between  fuzzy  predicates  have  been 
discussed  and  formulated  [15"18].  Until  now  some  real  tools  for  developing  fuzzy 
systems  have  been  also  developed  [19,20]. 

In  this  section  we  attempt  to  extend  MLL  so  that  we  can  handle  such  a  fuzzy 
inference  as  (9)  in  the  extended  MLL.  In  MLL  predicates  are  assumed  2-valued 
predicates.  The  variables  may  be  quantified  like  i^x/X)  indicating  (Vx  6  X) 
and  (^x/X)  indicating  {3x  £  X).  The  values  of  variables  may  be  sets  but  they 
are  assumed  crisp  sets.  Constant  terms  may  also  be  sets  but  they  are  assumed 
crisp  sets.  The  inference  rules  of  MLL  are  formalized  in  [5,6].  We  show  two  of 
these  here. 

{(^xlX)P[xla  eX}\-  P[a].  (10) 

{{yx/X)P[x\,  (V2//F)[P[y]  ^  Q[y]],  Y  2  X}  h  {'ix/X)Q\x].  (11) 

(11)  is  the  modus  pones  rule  in  MLL.  To  incorporate  fuzzy  inference  rule  into 
MLL,  we  need  additional  inference  rules.  Relating  to  (10)  and  (11),  we  need 

{[\fxlXo)P[x\,a  £a  X}  h  P[a].  (12) 


{(Vx/X„)P[xl,  (Vy/y^)[P[yl  ^  Q[y\],  O  X^}  h  {MxlX^)Q[x].  (13) 

where  7  in  the  conclusion  of  (13)  is  conditioned  to  be  7  = 

The  unification  rule  of  fuzzy  constants  is  also  required  to  the  extended  MLL. 
For  example,  unification  between  height(x,  tall)  and  height{Xy  very. tall)  should 
be  possible.  Unification  between  a  literal  value  and  a  numerical  value  of  a  fuzzy 
constant  is  also  considered. 

In  the  following,  we  illustrate  a  simple  example  of  inference  involving  fuzzy 
predicates.  We  transform  following  (14)  step  by  step  into  the  form  in  the  ex¬ 
tended  MLL. 


(Wx)\person{x)  Atall{x)  — >■  heavy (x)] 
{Wx)[boy{x)  person{x)] 

_ boy{john),very.tall{john) _ 

very. heavy  (John) 


{very) 


(14) 


In  (14),  'very'  is  attached  to  the  horizontal  line  to  indicate  that  the  inference 
is  performed  with  fuzzy  unification  between  'tali'  and  'very. tali'.  It  shows  that 
the  grade  of  tallness  should  be  transmitted  to  the  grade  of  heaviness  in  the 
conclusion. 

We  transform  (14)  to  (15)  using  the  axiom  of  intensionality  of  a  set  and  a-cuts 
of  a  fuzzy  predicate  (fuzzy  set),  we  first  rewrite  person{x)  by  a;  6  PERSON, 
hoy{x)  by  X  G  BOY,  and  boy  (John)  by  John  £  BOY.  Next  we  rewrite  each 
fuzzy  predicate.  For  example,  tall{x)  by  talla{x)  indicating  that  'tali'  is  a  fuzzy 
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predicate  and  its  fuzziness  is  determined  by  the  fuzzy  membership  function 
induced  from  the  a-cuts  TALL^  of  the  fuzzy  set  TALL.  Finally  we  rewrite 
very.tall{john)  by  tallveryijohn).  The  fuzziness  7  of  the  conclusion  is  calcu¬ 
lated  by  the  composition  rule  attached  to  the  horizontal  line: 

(Va:)[a:  G  PERSON  A  talla{x)  heavy p{x)] 

(VxMx  e  BOY  -^x€  PERSON] 

_ €  BOYytalkeryjjohn) -  ^  ©{(8){q,  very},l3}) 

heavy^{john) 

We  next  transform  (15)  into  (16)  using  the  notations  given  in  (17). 


{^x/PERSON)[tallaix)  heavy^ix)] 
BOY  c  PERSON 
johix  €  BOYjtallyeryi^j^hTi^ 
heavy  ^{john) 


(7  =  0{(gi{a,  very},|0}) 


(16) 


{^xlX)p{x)  =  {yx)[x  eX^  p(a:)] 
l3x/X)plx)  =  (3x)[x  G  X  Ap(a;)] 


(17) 


In  (16),  the  fuzzy  predicates  are  left  as  these  are,  but  the  second  premise  in 
(15)  is  replaced  by  the  set  inclusion  relation.  Similar  to  (15),  the  calculation  of 
fuzziness  7  of  the  conclusion  is  performed  using  the  composition  rule  attached 
to  the  horizontal  line. 

Finally  we  transform  (16)  into  the  complete  extended  MLL  form.  Until  now 
we  have  represented  a  fuzzy  predicate  such  as  Pa{x).  Here  we  introduce  a  new 
notation  p  :  a  in  order  to  declare  that  a  predicate  p  is  a  fuzzy  predicate  or  a 
fuzzy  term  having  a  as  its  fuzzy  parameter.  If  we  write  p  :  a{x),  we  assume 
that  p  :  a  is  a  fuzzy  predicate  identifier.  Otherwise,  namely,  if  we  write  simply 
p  :  a,  we  assume  it  is  a  fuzzy  term.  This  results  in  the  extension  of  the  well 
formed  formulas  in  MLL.  a  may  be  a  variable  or  a  constant  given  either  literally 
or  numerically.  We  note  here  that  the  truth  value  of  p  :  Q!(x)  is  determined  as 
follows. 


.  .  f  true  iS  X  £  PccC\X  where  P^  is  an  a-cut  of  P  . 

P  =  [  false  otherwise 

where  X  denotes  the  domain  of  x.  Then  for  the  first  premise  of  (16),  talla{x)  is 
written  as  tall  :  a{x),  heavyp{x)  as  heavy  :  I3{x).  Using  this  notation  (16)  can 
be  rewritten  as  follows. 


{\/x/PERSON)[tall :  a{x)  heavy  :  /3{x)] 
BOY  C  PERSON 

_ John  G  BOY, tall :  veryjjohn) _ 

heavy  :  j{john) 


(7  =  0{<8){a,  very},  /3}) 


(19) 


As  a  result,  if  a  =  /?  in  (19)  and  0  =  min  and  0  =  max,  we  can  easily  conclude 
in  7  =  very  in  heavy  :  'y{john),  that  means  ^  John  is  very  heavy\  The  extended 
MLL  intends  to  use  such  a  notation  as  (19). 
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2.3  Probabilistic  Reasoning 

We  can  apply  probability  theory  to  reasoning  with  uncertainty.  However  the 
mixed  use  of  probability  theory  and  fuzzy  set  theory  is  dangerous.  There  exist 
critical  differences  between  them.  Fuzzy  set  theory  deals  with  vague  and  impre¬ 
cise  notions  and  defines  partial  degrees  of  truth.  On  the  other  hand,  probability 
theory  deals  with  crisp  notions  and  does  not  define  partial  degrees  of  truth  but 
defines  the  degree  of  belief  on  truth  [21].  Consider  the  following  assertion. 

If  X  is  tall  (A)  and  x  is  heavy  (B),  then  x  is  strong  (C).  (20) 

In  fuzzy  set  theory,  if  A  and  B  are  partially  satisfied  with  some  evidence  A*  and 
B*,  then  the  graded  truth  value  of  the  conclusion  C*  can  be  assigned  with  the 
composition  rules  between  the  membership  functions  of  A*  and  B*.  In  proba¬ 
bility  theory,  (20)  can  be  rewritten  using  probabilities  such  as 

If  A  is  p-probable  and  B  is  ^-probable,  then  C  is  r-probable.  (21) 

where  p,  q  and  r  are  probabilities  of  A,  B  and  C  respectively.  These  probabil¬ 
ities  define  the  degrees  of  beliefs  of  A,  B  and  C.  If  some  evidences  p*  (A)  and 
q*{B)  are  given,  then  we  can  conclude  r*{C)  by  some  probabilistic  composi¬ 
tion  rules.  For  example,  plausible  and  possibility  reasoning  are  formulated  by 
using  belief  functions  in  Dempster-Shafer’s  evidence  theory  [1,22]  and  proba¬ 
bilistic  measures  described  in  [23].  The  Bayes  approach  [22]  using  conditional 
probabilities  is  strictly  probabilistic  approach.  Baldwin  et  al.  have  relaxed  prob¬ 
ability  measures  and  they  have  formulated  support  pairs  of  necessity  measures 
and  possibility  measures  [19]  as  probabilistic  measures  for  formulas.  Dubois  et 
al.  also  have  relaxed  probability  measures  by  using  mass  functions  as  probabil¬ 
ities  [18].  In  any  way,  we  can  apply  probabilistic  approach  for  reasoning  with 
uncertainty  under  certain  restrictions.  The  main  problems  would  be  what  norms 
of  uncertainty  and  composition  rules  are  available  to  reasoning  with  uncertainty 
in  real  applications.  The  real  implementation  of  SKAUS  which  can  reason  with 
uncertainty  should  incorporate  a  selection  mechanism  of  appropriate  uncertainty 
measures  and  composition  rules. 


3  Incorporating  Matrix  Logic  into  MLL 

In  August  Stern’s  matrix  logic  [25],  logical  truth  values  and  connectives  are  rep¬ 
resented  by  logic  vectors  and  matrix  operators.  In  this  formulation,  not  only 
the  ordinary  2-valued  logic  but  also  many  valued  logic  including  fuzzy  logic, 
modal  logic  and  probabilistic  logic  are  uniformly  treated  in  the  same  frame¬ 
work.  Matrix  logic  is  closely  related  to  neural  network  computing  because  of  its 
algebraic  treatment  of  objects.  By  incorporating  matrix  logic  into  SKAUS  as 
a  meta-predicate,  we  could  expect  that  the  fusion  of  symbolic  processing  and 
non-symbolic  processing  is  realized. 
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3.1  Matrix  Logic 

In  this  section,  we  introduce  matrix  logic  by  August  Stern  shortly.  First  we  give 
some  notations  used  in  matrix  logic. 

row  logic  vector  :  <  p|  =  (p,p)  (called  bra  vector) 

,  . _ I _ (22) 


column  logic  vector  :  \<i  >= 


(called  ket  vector) 


where  p  and  q  in  the  left  side  of  (22)  are  atomic  formulas,  whereas  p  and  q  in  the 
right  side  represent  truth  values.  In  2-valued  logic,  if  p  is  true,  then  <  p\  =  (0, 1). 
The  inner  product  and  outer  product  of  logic  vectors  are  written  as  follows. 


inner  product  :  <  x\y  >=  { 


=  xy  +  xy 


outer  product  :  |a:  ><  y\  =  j  (=  12),  1|  O  |1=  xy -h  xy  +  xy  +  xy  =  1 

Q  is  called  universal  operator.  A  matrix  operator  L  is  written  as  follows. 

<  x\L\y  >=<  x\  ^  \y  ^iLiiyi  -f-  xiL\2y2  +  X2L2iyi  4-  X2L22y2  (24) 


^ote  that  <  x\L\y  >  =<  x\L\y  >,  x  =  1  -  x,  \x  >=  1  -  \x  >=  ^\x  >,  and 
X  =  1-  L^->L,  where  1  and  i  indicate  a  special  vector  and  an  operator 
respectively,  and  these  components  are  all  one.  Another  point  is 

<x\A\y  >=<  x\l  ><  l\y  >=<  l\x  ><  y\l  >=<  l|L(|x  ><  y\)\l  >  (25) 

where  |1  >  is  a  column  vector  of  <  1|  —  (0,1),  and  L{\x  ><  y\)  is  a  variable 
logic  operator,  taking  different  shapes,  depending  on  the  values  obtained  by  the 
vectors  |a:  >  and  <  i/I- 

Some  examples  of  matrix  operators  are  shown  below. 

A  =  (n).v  =  (;;),-=(ji),-  =  (;;).i=(i;),t=(JS)  (26) 

I  and  t  are  nand  and  nor  operators  respectively.  We  see  — >■=  -^V  from  (26).  In 
matrix  logic,  the  modus  ponens  rule  is  represented  like 

{xA{x-^y))-~^y=<  a;|l  ><  x\  \y  ><  Ojy  >  =  1  (27) 

Note  here  that  (27)  is  obtained  using  (25)  and  <0|i/>=-^l2/>. 


3.2  Neural  Networks 

In  this  section  we  consider  the  problem  of  adjusting  membership  functions  of 
fuzzy  sets  using  neural  network  techniques.  As  well  known,  by  combining  fuzzy 
systems  with  neural  networks,  we  can  add  the  learning  ability  to  fuzzy  systems. 
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On  the  other  hand,  difficulties  of  the  plain  explanation  of  neural  computations 
are  overcome  [25]. 

We  consider  here  a  fuzzy  inference  scheme  given  in  (9).  We  encode  (9)  in  the 
neural  network  using  matrix  logic  as  follows. 


input  from  neuron  X  hidden  layer  output  to  neuron  Y 

(r):  <  >l(a:)|  |B(x)  >  ,  , 

(a):<^*(a)| _  (g.. 

(b):  |B‘(a)  >  ^ 


The  hidden  layer  receives  input  (a)  and  computes  (b)  by  applying  (r).  In  each 
computing  cycle  of  (b),  the  output  function  of  the  hidden  neuron  adjust  mem¬ 
bership  functions  used  in  (r)  by  applying  a  fuzzy  version  of  (27),  such  that  the 
following  equation  is  satisfied. 

(l-a*,a*)(j)(l-a,a)(j5)  (^-^)  =^*  (29) 

We  note  here  that  we  used  the  algebraic  product  as  t-norm,  and  the  algebraic 
sum  as  s-norm  for  calculating  /3*  in  (29).  We  also  assumed  fuzzification  of  input 
at  (a)  and  defuzzification  of  output  at  (b)  in  (28)  are  performed  as  the  preprocess 
and  postprocess  respectively.  Furthermore,  we  note  that  in  a  fuzzy  version  of  (27) 
under  (29),  <  l3*\  A  \p*  >=<  p*\  A  -^\p*  >—  P*{1  —  p*)  ^  0  in  general. 


4  Conclusion 

We  have  applied  an  idea  of  a-cuts  of  fuzzy  sets  to  formulate  the  extended  MLL 
which  can  perform  reasoning  with  uncertainty.  We  have  also  considered  matrix 
logic  as  a  tool  for  the  fusion  of  symbolic  computation  and  non-symbolic  (nu¬ 
merical)  computation  with  relation  to  neural  networks.  SKAUS  which  is  based 
on  the  extended  MLL  and  having  a  meta-predicate  of  matrix  logic  computation 
will  be  expected  to  enlarge  applications  under  the  real  environments. 
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Abstract.  In  this  paper,  we  will  introduce  a  novel  perspective  on  Fuzzy 
Logic  by  referring  to  the  theories  of  Chu  Space  and  Information  Flow, 
i.e.,  Channel  Theory,  which  results  in  a  deep  insight  on  the  interac¬ 
tion  and  coordination  of  agents  with  environments.  First,  a  constraint- 
oriented  interpretation  of  fuzzy  set  is  introduced  yielding  the  notion  of 
Constraint-Interval  Fuzzy  Set  (CoIFS).  Then  the  above  theories  are  in¬ 
troduced,  which  elucidate  the  basic  structures  of  fuzzy  inference  as  con¬ 
straint  propagation  yielding  the  spaces  of  Coordination  and  Interaction. 
Also,  the  structure  of  Information  Transmission  Channel  of  constraint 
propagation  is  clarified  together  with  its  relevance  with  the  results  by 
the  theory  of  Chu  Space.  All  the  results  can  be  used  to  elucidate  the 
basic  structure  of  “interfacing  (interfacing  media)”  between  agents  and 
environments. 


1  Introduction 

Various  artificial  systems  are  concerned  complicatedly  with  human  beings,  soci¬ 
etal  systems,  and  environments.  It  becomes  more  and  more  important  to  manage 
these  complexities  and  to  improve  the  quality  of  interactions  among  them.  We 
will  focus  ourselves  to  the  boundaries,  media,  and  mechanisms  of  interactive 
systems  concerned  with  the  structural  coupling  between  subjects  and  environ¬ 
ments.  Here  we  will  introduce  constraint-oriented  perspectives  on  these  interac¬ 
tions  and  the  symbolization  of  continuously  valued  quantities.  For  this  symbol¬ 
ization,  we  will  introduce  the  notion  of  Constraint-Interval  Fuzzy  Set  (CoIFS) 
which  we  have  introduced  to  tie  up  Fuzzy  Logic  and  the  traditional  two- valued 
logic  which  underlies  the  traditional  AI.  Namely,  we  encode  action  space  of  the 
subject  and  sensation  signal  space  from  the  environment  into  constraint-interval 
fuzzy  sets  and  also  introduce  spaces  for  representing  the  background  structures 
of  situation— action  relation,  where  the  correspondence  of  “constraint-intervals” 
with  “constraint-levels”  between  CoIFSs  is  treated. 

Moreover,  by  introducing  the  theories  of  Chu  Space  and  Information  Flow 
to  Fuzzy  Logic,  we  will  provide  two  novel  perspectives  on  the  theory  of  CoIFS’s. 


315 


Introducing  Chu  Space  provides  us  with  not  only  formal  treatment  of  constraint 
propagation  among  fuzzy  sets  but  also  a  new  relation  among  fuzzy  sets.  Also, 
introducing  Information  Flow  gives  new  perspective  on  the  constraint  propa¬ 
gation  as  flow  of  information  and  on  systems  of  fuzzy  sets  as  distributed  and 
decentralized  systems. 

2  Fuzzy  Logic-Based  Coding  of  Constraints 

Based  on  the  constraint-oriented  perspectives  on  problem  solving,  a  constraint- 
interval  fuzzy  set  (CoIFS)  is  given  as  an  ordered  collection  of  intervals  (constraint- 
intervals).  We  have  also  proposed  fuzzy  logic-based  operations  and  the  defuzzi¬ 
fication  for  the  CoIFS  and  also  elucidated  the  specificities  of  CoIFS,  i.e., 

-  Symbolic  (hard)  inference  can  be  related  to  fuzzy  (soft)  inference  on  the 
same  ground  [1]. 

-  CoIFS  involves  chaotic  characteristics  in  the  interaction  of  constraint  prop¬ 
agation  via  symbolic  reasoning  [2]. 

-  CoIFS  plays  a  role  of  distributed,  concurrent  and  self-organizing  module  for 
“symbiotic”  problem  solving  [3]. 


Fig.  1.  An  example  of  constraint-interval  fuzzy  set 


As  shown  in  the  X  -  A  space  of  Fig.  1,  a  CoIFS  is  given  as  an  ordered  col¬ 
lection  of  “crisp”  intervals  on  the  universe  of  discourse  (i.e.,  the  space  X)  each 
of  which  represents  a  constraint  called  interval  constraint  The  grade  axis  (in 
the  traditional  Fuzzy  Set  Theory)  is  now  regarded  to  be  an  “ordinal”  scale  axis. 
By  introducing  this  notion  of  fuzzy  set,  a  continuous  variable  can  be  coded  into 
symbols  via  fuzzy  sets  (fuzzy  labels),  which  divides  the  domain  of  a  variable 
(universe  of  discourse)  into  intervals.  Fuzziness  in  each  variable  is  derived  by  de¬ 
composing  a  “joint  constraint”  on  several  variables  into  componential  (marginal) 
constraints  through  “projection”  of  the  joint  constraint  onto  componential  vari¬ 
ables. 
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Relations  between  constraint  intervals  of  two  fuzzy  sets  are  represented  by 
“rectangles”  in  the  X  -  K  space  as  shown  in  Fig.  1  [1].  Namely,  if  we  have  a  crisp 
constraint  relation  C  on  a  pair  of  variables  (say  x  and  ?/),  we  can  approximate 
constraint  C  by  introducing  an  appropriate  constraint  interval  fuzzy  set  on  x  and 
that  on  2/,  respectively. 

Suppose  that  we  are  now  planning  to  go  to  a  tourist  resort  R  for  the  next 
holiday.  Now  we  have  to  decide  a  sightseeing  spot  in  the  outskirts  of  R  to  visit. 
The  constraint  now  is  the  speed  limit  on  the  freeway  (60  <  s  <  100).  We  have 
about  4  hours  for  one  way,  but  it  depends  on  the  time  when  all  of  us  get  up.  In 
this  case,  constraint  C  is  represented  as  the  meshed  area  shown  in  Fig.  1,  and 
the  “time  for  one  way”  and  the  “distance  from  our  city  to  the  sightseeing  spot” 
are  represented  by  fuzzy  sets.  A  portion  of  the  constraint  region  is  approximated 
by  “rectangles”  which  represent  the  relations  between  constraint-intervals  of  two 
fuzzy  sets.  The  vertically  oblong  rectangles  show  that  if  all  of  us  assemble  at  the 
meeting  time  conscientiously,  there  may  be  many  alternatives  for  visiting  spots. 
Otherwise,  the  alternatives  are  limited. 

CoIFS  is  a  kind  of  “topological”  representation  of  fuzzy  sets.  Departing  from 
the  direct  representation  of  fuzzy  sets  by  numeric  “membership  grade  value” 
results  in  the  following  characteristics: 

-  The  “shape”  of  a  fuzzy  set  is  of  no  use,  and  only  their  “types”  such  as 
symmetricity  are  meaningful. 

-  In  the  traditional  Fuzzy  Set  Theory,  it  is  implicitly  assumed  that  fuzzy 
sets  are  related  “statically”  via  grade  values.  For  CoIFSs,  the  same  rela¬ 
tion  (constraint-level  equivalence)  can  be  defined,  which  is  “dynamically” 
changeable  according  to  the  contexts  of  problems. 

-  Several  concepts  such  as  a-level  set  [4]  and  L-flow  set  [5]  have  been  already 
proposed,  where  fuzzy  sets  are  dealt  with  a  set  of  intervals.  However,  CoIFS 
involves  the  following  novel  characteristics:  1)  fuzzy  sets  are  related  to  each 
other  through  the  correspondence  among  constraint-levels  in  order  to  reflect 
the  topological  structure  of  universe  of  discourse.  2)  fuzzy  sets  are  regarded 
as  the  organization  of  the  crisp  constraint-intervals,  which  is  changeable 
according  to  the  order  relations. 

The  A  -  A'  space  of  Fig.  1  shows  the  constraint-level  equivalence  relations  be¬ 
tween  two  fuzzy  sets  shown  in  the  X  —  A  and  Y  —  A^  spaces  in  the  figure. 

3  Introducing  Chu  Space  to  Fuzzy  Logic 

In  this  section,  we  will  introduce  the  notion  of  Chu  Space  [6, 7]  to  Fuzzy  Logic 
for  the  formal  treatment  of  constraint  propagation  among  two  fuzzy  sets. 


3.1  A  brief  review  of  Chu  Space 

A  Chu  space  A  =  (I,  A,  R)  is  a,  binary  relations  between  two  sets  I  and  A,  where 
R  :  I  X  A  —>  X  gives  a  binary  relation,  and  X  is  the  set  {0, 1}.  A  Chu  space  can 
be  represented  as  a  binary  matrix  of  dimension  \A\  x  |/|  (Chu  map). 
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Given  a  Chu  space  A  =  (I,A,E)y  functions  R  :  I  and  R  :  A  ^ 

satisfying  R(i){a)  =  R{i,a)  =  ft(a){i)  are  the  representations  of  I  and  A 
respectively  ^ .  R{i)  i  A  U  and  R{(i)  1 1  U  are  called  column  and  row  of  A^ 
respectively.  R{i)  represents  i  as  a  function  from  A  to  i7,  and  R(a)  represents  a 
as  a  function  from  I  to  E, 

Given  two  Chu  spaces  A  =  (I,A,R)  and  B  =  (J,B,S),  a  pair  of  functions 
f  :  I  ^  J  and  g:  B  ^  Ais  called  “Chu  transform”  from  A  to  B,  provided  that 
the  following  adjointness  condition  holds: 

(Vi  e  I,  Wbe  B)  S(f{i),h)  =  R{h9{b)).  (1) 

The  dual  ot  A  =  (/,  A,  R)  is  defined  as  A^  =  (A,  /,  Ry),  where  R^  :  Ax  I  ^ 
E.  The  dual  of  Chu  transform  (/,p)  from  ^  to  5  is  a  Chu  transform  {g,  f)  from 
B-^  to  A-^. 

The  composition  of  S  :  J  E^  amd  f  :  I  J,  denoted  as  5/,  represents 
/,  since  5(/(i))  represents  the  image  of  /  as  a  function  from  B  to  E.  Later  on, 
we  will  write  Sf  as  ^  :  I  E^  such  that  <j)(i,b)  =  S{f(i)){b)  =  S{f{i),b).  The 
Chu  space  T  =  (/,  B,  </>)  represents  /.  On  the  other  hand,  the  composition  of 
R  :  A  E^  and  g  :  B  ^  A,  denoted  as  :  B  E^,  satisfies  the  relation: 
(l)'^(bj)  =  R{g{b)){i)  =  R{g{b),i).  The  Chu  space  represents  g. 

Prom  the  fact  that  the  relation:  6)  =  implies  S(f(i),b)  -  R{i,g(b)), 

we  can  interpret  the  adjointness  conditions  as  follows: 

Given  two  Chu  spaces  A  =  {I,A,R)  and  B  =  (J,B,S),  (f,g)  is  a  Chu 
transform  from  A  to  5,  where  T  =  {I,B,(j))  represents  the  function 
f  :  I  J  from  A  to  B,  and  its  dual  =  {BJ,(t>^)  represents  the 
Lnction  g  :  B  A  from  B-^  to  A-^ . 


3.2  Coding  fuzzy  sets  and  their  constraints  by  Chu  Space 

We  introduce  an  interpretation  of  fuzzy  set  as  a  Chu  space  which  consists  of 
the  set  X  of  universe  of  discourse,  a  constraint-level  set  A  and  their  relation 
R:XxA-^E. 

As  shown  in  Fig.  1,  a  portion  of  the  constraint  C  on  a  pair  of  variables  X  and 
Y  is  approximated  by  a  set  of  rectangles,  which  transmit  intervals  from  one  fuzzy 
set  to  another.  In  the  case  where  two  fuzzy  sets  are  represented  by  Chu  spaces 
A  {X,A,R)  and  B  =  (y,A',5),  the  constraint  propagation  (transmission) 
is  represented  as  A  B-^,  which  can  be  seen  as  a  Chu  transform  {f,g)  from 
A  =  {X,A^R)  to  B^  =  (A',y,S'^),  where  two  functions  f  :  X  ^  A’  and 
g  :Y  A  satisfy  the  following  condition: 

(Vo:  €  X,  V2/  G  Y)  S^{f{x),y)  =  Rix,g(y)).  (2) 

The  Chu  transform  (f,g)  :  A  B-^  itself  also  constitutes  a  Chu  space  T  = 
(X,y,  (/>)  which  is  represented  as  a  Chu  map  with  dimension  |y|  x  |X|.  Namely, 

^  E^  is  the  collection  of  all  maps  from  A  to  E 
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Chu  space  means  a  portion  of  the  constraint  region  of  space  X  —  Y  which  is 
approximated  by  rectangles. 

Since  T  represents  g  and  represents  /,  we  can  interpret  the  adjointness 
condition  as  follows: 

(Vx  e  X,  V2/  G  Y)  y)  =  S^{f{x),  y),  (3) 

(Vx  G  X,  V2/ G  F)  (j)^{y,x)  =  R{x,9{y)).  (4) 

The  map  between  row  y  oi  T  and  the  corresponding  row  g{y)  of  A  is  given  by 
5,  and  that  of  column  xoiT  and  the  corresponding  column  /(x)  of  is  given 
by  /.  In  other  words,  /  and  g  provide  projection  of  T  onto  A  and  as  shown 
in  the  left  part  of  Fig.  2. 


Fig.  2.  Chu  transform  between  two  fuzzy  sets  with  adjointness  relation 


The  function  /  associates  each  point  x  e  X  with  an  interval  represented 
by  A'  G  A'  which  equivalently  represents  constraint  propagation.  Conversely, 
function  g  transforms  a  point  y  €Y  toXe  A  showing  a  constraint  propagation  to 
an  interval,  where  constraint  propagation  is  given  as  yielding  an  interval  derived 
by  the  intersection  of  cissociated  intervals  for  points  in  the  original  interval. 

Based  on  Chu  transform  (/,p)  :  A  B-^,  we  can  define  a  Chu  transform 
B-*-  A  which  is  a  pair  of  function  p  :  A'  X  and  q  :  A  Y,  denoted  by 
{p,q),  such  that  the  following  condition  holds: 

(Va  G  V6  G  A)  R(p{b),a)  =  S^{b,qia)).  (5) 

The  Chu  transform  B^  A  is  a.  Chu  space  denoted  asV  =  (A\  A,  'ip)  as  shown 
in  the  upper-right  part  of  Fig.  2. 

While  space  T  represents  the  propagation  of  horizontal  intervals  in  fuzzy 
sets,  the  space  V  represents  those  of  vertical  intervals,  i.e.,  the  corresponding 
constraint-levels.  The  matrix  representation  of  T  does  not  correspond  directly 
to  the  rectangular  representation  of  constraint  combination  of  fuzzy  sets.  The 
rectangles  are  generated  from  the  matrix  representation  of  T  by  referring  to  V. 
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A  pair  of  fuzzy  sets  that  are  connected  by  Chu  transform  {f,g)  =  A 
and  {p,q)  -  ^  A  can  be  interpreted  as  a  Chu  transform  from  ^  to  P.  In 

other  words,  it  is  a  transformation  from  the  space  of  constraint-intervals  to  that 
of  constraint-levels,  and  vice  versa. 

We  will  call  A  and  B^  Fuzzy  Spaces^  respectively,  and  T  and  V  will  be 
called  Interaction  Space  and  Coordination  Space,  respectively.  These  four  spaces 
constitute  a  unified  construction  of  the  interaction  and  coordination  involved  in 
the  setting  of  problems. 

4  Introducing  Channel  Theory  to  Fuzzy  Logic 

In  this  section,  we  will  introduce  the  notion  of  Channel  Theory [8], [9]  to  Fuzzy 
Logic  for  organizing  distributed  systems  which  consist  of  CoIFSs. 


4.1  A  brief  review  of  Channel  Theory  on  Information  Flow 

Barwise  and  Seligman  have  proposed  Channel  Theory  which  gives  a  mathemat¬ 
ical  framework  to  Dretske’s  qualitative  theory  of  information.  Channel  Theory 
is  a  qualitative  theory  of  information  which  treats  the  content  of  information 
rather  than  its  amount.  Based  on  the  notion  of  classification  and  infomorphism, 
Channel  Theory  involves  the  concepts  of  information  channel  and  local  logic. 

A  “classification”  A  =  consists  of  a  set  A  of  objects  to  be 

classified,  called  “tokens”  of  A,  a  set  Ea  of  objects  used  to  classify  the  tokens, 
called  “types”  of  A,  and  a  binary  relations  |=^  between  A  and  Ea  indicating 
the  types  to  which  the  tokens  to  be  classified  into. 

For  any  classification  A,  a  pair  {F,  A)  of  sets  of  types  Ea  is  called  a  “sequent” 
of  A.  A  token  a  of  A  satisfies  {F,  A)  provided  that  if  a  is  of  type  a  for  every 
a  £  F  then  a  is  of  type  a  for  some  a  €  ^.  If  every  token  a  of  A  satisfies  {F,  A), 
it  will  be  written  as  F  a 

Given  A  =  {A,  |==^>  and  C  =  {C,  Ec,  \=c),  a  pair  (f,  f)  of  functions  is 

called  an  “infomorphism”  from  A  to  C,  provided  that  the  following  holds: 

r  • 

r:C~^  A, 

Vc  G  (7,  Vq  G  Ea^ 

/''(c)  \=A  a  ^c\=c  r(a). 

/ 

An  information  channel  consists  of  an  indexed  family  C  =  {fi  :  Ai  C}ie/  of 
infomorphisms  with  a  common  codomain  C  as  shown  in  Fig. 3.  The  classification 
C  is  called  the  “core”  of  the  channel  and  its  token  is  called  “connection”  among 
tokens  =  ff,  i  £  1. 

A  local  logic  C  =  {A,  (-£,  consists  of  the  following  three  components: 

1.  a  classification  A=  {A,  Ea,  t=.4)- 

2.  a  set  I- £  of  sequent s  of  A. 
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Fig.  3.  An  example  of  an  information  channel 

3.  a  subset  Nc  ^  A,  called  the  normal  tokens  of  each  element  (token)  of 
which  satisfy  all  the  sequents  of  l-£. 

According  to  the  definition  of  normal  tokens,  there  exit  several  local  logics  in  a 
classification  in  general.  In  such  a  sense,  local  logic  has  “locality”  in  the  core  C 
of  the  channel. 

4.2  Coding  fuzzy  sets  and  their  constraints  by  Channel  Theory 

By  regarding  a  collection  Ix  of  intervals  over  the  universe  of  discourse  X  and 
a  constraint-level  set  A  as  tokens  and  types,  respectively,  we  interpret  a  fuzzy 
set  as  classification  A  =  {IxjA,\=a)‘  A  token  which  is  associated  with  a  type 
represents  a  “constraint  interval”  and  the  collection  of  these  tokens  constitute 
the  fuzzy  set. 


Fig.  4.  A  channel  theoretic  model  of  constraint  propagation 


Given  two  fuzzy  sets  A  —  {/x?  A,  \=j)  and  B  =  (/y ,  A',  \=b)  as  being  classi¬ 
fications,  the  “constraint  region”  generated  by  these  fuzzy  sets  is  represented  as 
the  following  classification  C\ 

1.  The  tokens  of  C  is  the  Cartesian  product  of  Ix  and  /y.  More  precisely,  the 
tokens  of  C  are  pairs  (ixAy)  of  tokens  {ix  €  Ix^iy  ^  Iy)  and  represent 
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“rectangular  region”  which  is  the  product  of  interval  ix  over  X  and  interval 
iy  over  Y. 

2.  The  types  of  C  is  the  disjoint  union  of  A  and  A'.  For  simplicity,  the  types  of 
C  are  represented  as  pairs  (i,  o;),  where  i  =  0  and  a  E  Aoi  i  =  1  and  a  ^  A\ 

3.  The  classification  relation  \=c  is  defined  by 

He  (0? iff  Hv4  <^5 

{ix,iy)  He  iff  h  \=b  V. 

The  infomorphism  {(ta\<^a  )  A  C  and  '  B  C  is  defined  as 

follows: 

1.  a^XA)  =  (0,A)forVA€^ 

2.  (TB^(V)  =  <1,  A'>  for  VA'  G  A',  and 

3.  for  each  pair  {ixAy)^ 

Q-nd  <Tb  ((^x?^y)) 

We  can  describe  “regularity”  and  “order  formation”  among  conclusion  re¬ 
lations  over  several  spaces  through  infomorphism.  More  precisely,  when  certain 
relation  is  given  as  a  conclusion  of  other  relations,  each  infomorphism  maps  the 
types  representing  these  relations  onto  the  core  and  the  relations  among  mapped 
images  (types)  is  formulated  as  the  statement  that  a  token  satisfies  a  “sequent” , 
that  is  a  kind  of  proposition  on  C,  These  relation  is  represented  by  the  sequent 
{r^A)  which  consists  of  a  pair  of  F  =  {<^.a(^)}  and  A  =  {o'/3(A')},  where  A  G  yl 
and  A'  G  A\  The  sequents  {F.A)  to  which  every  token  in  C  should  be  subject 
can  be  considered  as  rectangular  regions  generated  by  constraint  intervals  of  two 
fuzzy  sets.  Furthermore,  the  whole  order  formation  given  by  these  propositions 
is  formalized  as  a  “local  logic” . 

5  New  Perspective  on  Fuzzy  Logic  by  Introducing 
Theories  of  Chu  Space  and  Information  Flow 

We  have  introduced  Chu  Space  and  Channel  Theory  to  Fuzzy  Logic  in  the  pre¬ 
vious  two  sections.  In  this  section,  we  will  discuss  the  place  where  the  structural 
coupling  between  subjects  and  environments  takes  place,  what  we  call  interfac¬ 
ing  media.  This  structural  coupling  in  the  interfacing  media  will  be,  we  think, 
elucidated  from  two  points  of  view,  i.e.,  from  Chu  Space  and  Channel  Theory. 

We  encode  action  space  of  the  subject  and  the  sensation  space  from  the 
environment  into  constraint-interval  fuzzy  sets,  and  also  introduce  spaces  for 
representing  the  background  structures  of  situation -action  relation,  where  the 
correspondence  of  constraint-intervals  with  constraint-levels  between  two  fuzzy 
sets  is  treated. 

By  introducing  Chu  Space  Theory  to  Fuzzy  Logic,  we  can  clarify  that  CoIFS 
can  naturally  represents  the  Coordination  Space  which  stands  for  the  coordi¬ 
nation  relation  between  constraint-intervals  in  two  fuzzy  sets.  A  coupling  of 
sensation  and  action  is  represented  as  the  adjointness  relation  among  constraint 
intervals  in  the  Interaction  Space  and  the  adjointness  relation  among  constraint 


322 


levels  in  the  Coordination  Space.  These  spaces  prescribe  each  other,  which  in 
turn  leads  to  the  “stabilization  of  coupling”  as  shown  in  Fig. 5. 

While  each  of  these  spaces  represents  adjointness  relation  as  a  Chu  transform, 
they  have  adjointness  relation  of  sensation  and  action  in  meta  level  when  these 
spaces  have  the  correspondence  with  each  other.  Given  certain  fluctuation  on  the 
side  of  the  subject  or  that  of  the  environment,  the  adjointness  relation  in  meta 
level  is  newly  formed.  Such  a  stabilization  of  coupling  makes  the  interaction  with 
the  environment  smooth  and  to  form  certain  order. 


A 

Sensation  ^ 
Space  / 


X  i 


Interaction 


'  i 


Space  j  - . Y 


P  Coordination 
Space 


Action 

Space 


Fig,  5.  Stabilization  of  the  structurzd  coupling 


On  the  other  hand,  Channel  Theory  treats  information  flow  qualitatively, 
i.e.,  it  puts  emphasis  on  the  “content”  of  information  rather  than  its  amount. 
Constraint  propagation  among  fuzzy  sets  propagates  the  constraint  informa¬ 
tion  (intervals)  with  preserving  the  relational  structure  among  constraint  levels. 
In  this  sense,  constraint  propagation  can  be  seen  as  information  flow  through 
channel  as  shown  in  Fig.6.  CoIFSs  with  constraint  relations  among  them  can 
be  interpreted  as  a  distributed  and  decentralized  system  on  the  medium  of  the 
channel. 


Fig.  6.  Formation  of  the  structural  coupling 
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Roughly  speaking,  an  infomorphism  which  constitutes  a  channel  is  equivalent 
to  a  Chu  transform  which  connects  Chu  spaces.  Chu  Space  and  Channel  Theory 
respectively  can  be  regarded  as  a  local  view  and  a  global  view  on  the  distributed 
and  decentralized  system. 

6  Conclusion 

Introducing  the  theories  of  Chu  Space  and  Information  Flow  has  shown  to  pro¬ 
vide  us  with  a  new  perspective  on  Fuzzy  Logic.  Particularly,  the  notion  of  CoIFS 
prescribes  the  coupling  structure  as  constraint  region  between  two  variables  such 
as  sensation  and  action,  and  Chu  Space  can  be  used  to  clarify  the  new  relational 
structure  in  the  coupling,  which  we  have  called  Coordination  Space.  Channel 
Theory  gives  the  form  of  constraint  propagation  as  flow  of  information  and  also 
provides  us  with  a  novel  view  that  elucidates  CoIFSs  connected  by  constraint 
relations  as  a  distributed  and  decentralized  system. 

There  will  be  the  following  things  as  future  works.  Namely,  we  have  to  exam¬ 
ine  theoretically  whether  our  treatment  of  sequents  is  suitable  for  Fuzzy  Logic  or 
not.  Since  the  fundamental  relationship  of  Channel  Theory  with  Fuzzy  Logic  is 
still  not  so  clear,  we  have  to  examine  more  detailed  relationship  with  Shannon’s 
Information  Theory. 
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Abstract.  This  paper  presents  pattern  reasoning,  that  is,  the  logical 
reasoning  of  patterns.  Pattern  reasoning  is  a  new  solution  for  the  knowl¬ 
edge  acquisition  problem.  Knowledge  acquisition  tried  to  acquire  linguis¬ 
tic  rules  from  patterns.  In  contrast,  we  try  to  modify  logics  to  reason  pat¬ 
terns.  Patterns  are  represented  as  functions,  which  Bxe  approximated  by 
neural  networks.  Therefore,  pattern  reasoning  is  realized  by  logical  rea¬ 
soning  of  neural  networks.  A  few  non-classical  logics  can  reason  neural 
networks,  because  neural  networks  can  be  basically  regarded  as  multilin¬ 
ear  functions  and  the  logics  are  complete  for  multilinear  function  space, 
therefore,  the  logics  can  reason  neural  networks.  This  paper  explains  in¬ 
termediate  logic  LC  as  an  example  of  the  logics  and  demonstrates  how 
neural  networks  can  be  reasoned  by  LC. 


1  Introduction 

This  paper  presents  pattern  reasoning,  that  is,  the  logical  reasoning  of  patterns. 
An  example  of  pattern  reasoning  is  presented.  For  example,  expert  doctors  di¬ 
agnose  using  a  lot  of  images  like  brain  images,  electrocardiograms  and  so  on, 
which  can  be  formalized  as  follows: 

Rule  1:  If  a  brain  image  is  a  pattern,  then  an  electrocardiogram  is  a  pattern. 
Rule  2:  If  an  electrocardiogram  is  a  pattern,  then  an  electromyogram  is  a  pat¬ 
tern. 

Using  the  above  two  rules,  we  can  reason  as  follows: 

If  a  brain  image  is  a  pattern,  then  an  electromyogram  is  a  pattern. 

This  is  a  pattern  reasoning.  Symbols  can  be  regarded  as  special  cases  of  patterns. 
For  example,  let  a  rule  be 

If  a  brain  image  is  a  pattern,  then  a  subject  has  a  disease. 

The  right  side  of  the  rule  is  a  symbol.  The  rule  can  be  regarded  as  a  special  case 
of  pattern  reasoning. 

The  pattern  reasoning  is  a  new  solution  for  knowledge  acquisition  problem. 
The  explanation  is  as  follows.  Since  it  is  important  to  simulate  human  experts  by 
computer  software,  expert  systems  have  been  studied  to  simulate  human  experts 
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by  computer  software.  Many  expert  systems  are  based  on  classical  logic  or  some¬ 
thing  like  classical  logic  (Hereinafter  “  classical  logic”  is  used  for  simplification). 

Knowledge  acquisition  is  necessary,  because  the  obscure  knowledge  of  hu¬ 
man  experts  cannot  be  reasoned  by  classical  logic,  while  linguistic  rules  can  be 
reasoned  by  classical  logic.  Knowledge  acquisition  means  the  conversion  from 
obscure  knowledge  of  human  experts  to  linguistic  rules.  Knowledge  acquisition 
has  been  studied  by  many  researchers,  but  the  results  have  not  been  successful, 
that  is,  the  results  show  that  knowledge  acquisition  is  very  difficult. 

Generally  speaking,  a  processing  consists  of  a  method  and  an  object.  For  ex¬ 
ample,  logical  reasoning  consists  of  the  reasoning  as  the  method  and  the  S3Tnbols 
as  the  object.  The  methods  of  the  processings  by  human  experts  are  a  kind  of 
reasoning,  which  are  different  from  the  reasoning  by  classical  logic.  The  objects 
of  the  processings  by  human  experts  are  patterns  (and  symbols).  That  is,  a  lot 
of  the  processings  by  human  experts  can  be  regarded  as  the  pattern  reasonings. 

Therefore,  the  pattern  reasoning  is  a  solution  for  the  knowledge  acquisition 
problem  based  on  conversion  of  the  knowledge  acquisition  to  a  completely  differ¬ 
ent  problem.  However,  readers  may  think  that  it  is  impossible  to  reason  patterns. 
This  paper  shows  that  patterns  can  be  reasoned  by  non-classical  logics. 

There  axe  several  possible  definitions  for  patterns.  Patterns  such  as  images 
can  be  basically  represented  as  functions.  For  example,  two-dimensional  images 
can  be  represented  as  the  functions  of  two  variables.  Patterns  are  functions.  Since 
it  is  desirable  to  be  able  to  deal  with  any  function,  3-layer  feedforward  neural 
networks,  which  can  basically  approximate  any  function[5],  are  studied. 

Therefore,  pattern  reasonings  are  realized  as  logical  reasonings  of  neural 
networks.  However,  classical  logic  cannot  reason  neural  networks,  while  a  few 
non-classical  logics  can  reason  neural  networks.  For  example,  intermediate  logic 
LC[1],[4],  product  logic[4],  and  Lukasiewicz  logic[4]  can  reason  neural  networks. 
The  reason  why  the  above  three  logics  can  reason  neural  networks  is  as  follows: 
Neural  networks  are  multilinear  functions  in  the  discrete  domain  and  are  well 
approximated  to  multilinear  functions  in  the  continuous  domain,  and  the  three 
logics  are  complete  for  multilinear  function  space,  therefore,  the  three  logics  can 
reason  neural  networks. 

The  key  is  the  multilinear  function  space.  In  the  domain  {0, 1},  the  multilin¬ 
ear  function  space  is  an  extension  of  Boolean  algebra  of  Boolean  functions.  The 
space  is  the  linear  space  expanded  by  the  atoms  of  Boolean  algebra  of  Boolean 
functions  and  can  be  made  into  a  Euclidean  space.  Logical  operations  are  repre¬ 
sented  as  vector  operations,  which  are  numerical  computations.  In  the  domain 
[0,1],  continuous  Boolean  functions  can  be  obtained.  Roughly  speaking,  continu¬ 
ous  Boolean  functions  consist  of  conjunction,  disjunction,  direct  proportion  and 
inverse  proportion.  The  multilinear  function  space  of  the  domain  [0,1]  is  the  lin¬ 
ear  space  of  the  atoms  of  Boolean  algebra  of  continuous  Boolean  functions  and 
can  be  made  into  a  Euclidean  space. 

As  explained  above,  multilinear  function  space  is  a  model  of  three  logics, 
but  due  to  space  limitations,  intermediate  logic  LC (Hereinafter,  LC  for  short)  is 
explained  in  this  paper.  Intermediate  logics  are  logics  which  are  stronger  than  in- 
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tuitionistic  logic  and  weaker  than  classical  logic.  The  multilinear  function  space  is 
an  algebraic  model  of  intuitionistic  logic,  but  intuitionistic  logic  is  not  complete 
for  the  space.  For  intuitionistic  logic,  refer  to  [1].  LC,  which  is  stronger  than 
intuitionistic  logic  and  weaker  than  classical  logic,  is  complete  for  the  space. 
Therefore,  multilinear  functions  can  be  regarded  as  propositions  of  LC.  Neural 
networks  which  can  be  basically  regarded  as  multilinear  functions,  can  also  be 
regarded  as  propositions  of  LC.  Therefore,  neural  networks  can  be  logically  rea¬ 
soned. 

Section  2  explains  multilinear  function  space  which  is  the  theoretical  founda¬ 
tion  for  the  logical  reasoning  of  neural  networks.  Section  3  explains  LC.  Section 
4  describes  how  neural  networks  can  be  reasoned  by  LC.  Section  5  states  several 
remarks  on  pattern  reasoning. 

2  Multilinear  function  space 

First,  the  multiliner  functions  are  explained.  The  domains  are  divided  into  dis¬ 
crete  domains  and  continuous  domains.  The  discrete  domain  is  reduced  to  {0, 1} 
and  the  continuous  domain  is  normalized  to  [0,1].  Therefore,  {0, 1}  domain  and 
[0,1]  domain  are  discussed.  Second,  it  is  shown  that  multilinear  function  space 
of  the  domain  {0, 1}  is  a  Euclidean  space  spanned  by  the  atoms  of  Boolean  al¬ 
gebra  of  Boolean  functions.  Third,  it  is  explained  that  the  space  of  the  domain 
[0,1]  can  be  made  into  a  Euclidean  space.  Fourth,  the  vector  representations 
are  explained.  Finally,  the  relationship  between  neural  networks  and  multilinear 
functions  is  explained. 


2.1  Multilinear  functions 

Definition  1  Multilinear  functions  of  n  variables  are  as  follows: 
Si=i  "  ? 

where  Oi  is  real,  Xi  is  a  variable,  and  Cj  is  0  or  1. 


In  this  paper,  n  stands  for  the  number  of  variables. 

Example  Multilinear  functions  of  2  variables  are  as  follows: 
axy  -f  +  q/  +  d. 

Multilinear  functions  do  not  contain  any  terms  such  as 

rpkx  k2  . . . 

J'l  •*'2  > 

where  >  2.  A  function  /  :  {0,  l}’^  ^  R  is  a  multilinear  function,  because 
x^*  =  Xi  holds  in  {0,1}  and  so  there  is  no  term  like  x^^x^"^  •  •  •  xjj”  {ki  >  2) 
in  the  functions.  In  other  words,  multilinear  functions  are  functions  which  are 
linear  when  only  one  variable  is  considered  and  the  other  variables  are  regarded 
as  parameters. 


2.2  Multilinear  function  space  of  the  domain  {0, 1} 

Definition  2  The  atoms  of  Boolean  algebra  of  Boolean  functions  ofn  variables 
are  as  follows: 
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“  11^=1  (*  ~  •••!  2”)) 
where  e{xj)  ='^  or  xj. 

Example  The  atoms  of  Boolean  algebra  of  Boolean  functions  of  2  variables  are 
as  follows:  xAy,  xAy^  xAy^  xAy. 

Theorem  1  The  space  of  multilinear  functions  ("{0, 1}”  — >  RJ  is  the  linear  space 
spanned  by  the  atoms  of  Boolean  algebra  of  Boolean  functions.  The  proof  can  be 
found  in  [11],  [12]. 

Definition  3  The  inner  product  is  defined  as  follows: 

^  f')  9  Xy-fo,i}"  ^9’ 

The  sum  in  the  above  formula  is  done  over  the  whole  domain. 

Definition  4  Norm  is  defined  as  follows: 

i/i  =  v<7rr>  =  ^E{o.i}./^- 

Theorem  2  The  multilinear  function  space  is  a  Euclidean  space  with  the  above 
norm.  The  proof  can  be  found  in  [11]. 


2.3  Multilinear  functions  of  the  domain  [0,1] 

Multilinear  functions  of  the  domain  [0,1]  are  briefly  described  in  this  subsection. 
For  details,  see  [8]. 

Definition  5  Definition  ofr 

Let  f{x)  be  a  real  polynomial  function.  Consider  the  following  formula: 
f(x)  =  p{x){x  -  x^)  +  q(x), 

where  q{x)  =  ax-\~b,  where  a  and  b  are  real,  that  is,  q{x)  is  the  remainder,  r®  is 
defined  as  follows: 

Tx  ■■  fix)  ^  q{x). 

The  above  definition  implies  the  following  property: 

Tx{x'^)  =  X. 

In  the  case  of  n  variables,  r  is  defined  as  follows: 

r=nr=i^x.. 

Example  T{x^y^  +  2/  +  1)  =xy  +  y  +  l. 

Theorem  3  The  space  of  multilinear  functions  (10,1]^  R)  is  a  Euclidean 
space  with  the  following  inner  product: 

<f,g>=  2”/q  rifg)dx. 

The  proof  can  be  found  in  [8]. 

Definition  6  Logical  operations  are  defined  as  follows: 

AND:  r{fg),  OR:  r{f  +  g  ~  fg),  NOT:  r(l  -  /). 
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Theorem  4  The  functions  obtained  from  Boolean  functions  by  extending  the 
domain  from  {0,1}  to  [0,1]  can  satisfy  all  axioms  of  classical  logic  with  the 
logical  operations  defined  above.  The  proof  can  be  found  in  [9], 

Therefore,  in  this  paper,  the  functions  obtained  from  Boolean  functions  by  ex¬ 
tending  the  domain  from  {0, 1}  to  [0,1]  are  called  continuous  Boolean  functions. 
Example  x,  1  -  x(=  x),  and  xy  are  continuous  Boolean  functions,  where  x,  y  € 
[0, 1] .  X  means  a  direct  proportion  and  x  means  an  inverse  proportion. 


2.4  Vector  representations  of  logical  operations 

Multilinear  functions  are  divided  into  Boolean  functions  and  the  others.  The  oth¬ 
ers  can  also  be  regarded  as  logical  functions,  which  will  be  explained  later.  The 
vector  representations  of  logical  functions  are  called  logical  vectors.  f((/i)),  gCCPi))?  •• 
stand  for  logical  vectors.  Note  that  /  stands  for  a  function,  while  fi  stands  for 
an  component  of  a  logical  vector  f. 

Vector  representations  of  logical  operations  axe  as  follows: 
f  Ag  =  {Min{fi,gi)),  f  Vg  =  (Max{fi,gi)),  f  =  (1  -  fi). 

When  multilinear  functions  are  Boolean  functions,  the  above  vector  representa¬ 
tions  of  logical  operations  are  the  same  as  the  representations  below, 
f  A  g  =  (fi9i),  fVg  =  (fi-\-9i-  fi9i).  f  =  (1  -  /i). 


2.5  The  relationship  between  neural  networks  and  multilinear 
functions 

Theorem  5  When  the  domain  is  {0,1},  neural  networks  are  multilinear  func¬ 
tions. 

Proof  As  described  in  2.1,  a  function  whose  domain  is  {0, 1}  is  a  multilinear 
function.  Therefore,  when  the  domain  is  {0, 1},  neural  networks  are  multilinear 
functions. 

When  the  domain  is  [0,1],  neural  networks  are  approximately  multilinear  func¬ 
tions  with  the  following: 

x^  =  x{k  <  a),  =  0(A:  >  a), 

where  a  is  a  natural  number.  When  a  =  1,  the  above  approximation  is  the  linear 
approximation. 

3  Intermediate  logic  LC  and  multilinear  function  space 

The  section  briefly  explains  an  intermediate  logic  LC  and  multilinear  function 
space.  Intermediate  logics  are  weaker  than  classical  logic  and  stronger  than  intu- 
itionistic  logic.  An  explanation  of  intermediate  logics  can  be  found  in  [1].  LC  is 
an  intermediate  logic,  which  was  presented  by  Dummett[2].  The  logic  is  defined 
as  follows  [1]. 

LC=intuitionistic  logic  +  (x  ->  y)  V  (y  ^  x), 
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where  x  and  y  are  logical  formulas.  LC  stands  for  Logic  of  Chain,  which  comes 
from  the  fact  that  the  model  of  the  logic  is  a  chain,  that  is,  a  linearly  ordered 
set. 

First,  it  is  explained  that  LC  is  complete  for  the  interval[0,l].  The  proof 
for  the  completeness  of  LC  for  the  model  cannot  be  described  due  to  space 
limitations,  and  so  an  intuitive  explanation  is  given.  Second,  it  is  explained  that 
the  multilinear  function  space  is  an  algebraic  model  of  LC.  Finally,  an  example 
of  logical  reasoning  of  multilinear  fimctions  by  LC  are  given. 


3.1  An  intuitive  explanation  for  LC 

Interval  [0, 1]  is  a  Heyting  algebra,  which  is  the  algebraic  model  of  intuitionistic 
logic,  with  the  following  definitions: 

a;  A  2/ =  Mm(aj,  2/),  a?  V  2/ =  Moa:(x,  y), 

<  S') 

^  \y{x>y), 

T  =  1,  ±  =  0,  where  x  and  y  stand  for  points. 

The  above  fact  can  be  easily  verified.  Let  x  and  y  stand  for  two  points,  then 

(x  <y)^iy  <  x) 

holds.  Roughly  speaking,  by  replacing  <  in  the  above  formula  by  the  following 
formula  is  obtained. 
ix-^y)v(y-^  x), 

where  x  and  y  are  propositions.  The  above  formula  does  not  hold  in  intuitionistic 
logic.  In  other  words,  intuitionistic  logic  is  not  complete  for  an  interval. 

If  the  above  formula  is  added  to  intuitionistic  logic,  a  logic  which  is  complete 
for  an  interval  is  obtained.  The  logic  is  LC. 

{x y)V  {y x), 

holds  in  LC,  therefore,  LC  is  complete  for  an  interval. 

The  completeness  of  LC  for  an  interval  can  be  proved  using  the  fact  that 
LC  is  complete  for  linearly  ordered  Kripke  models  [1]  and  the  correspondence 
between  Kripke  models  and  algebraic  models  [3]. 


3.2  The  multilinear  function  space  is  an  algebraic  model  of  LC 

It  is  explained  that  the  multilinear  function  space  is  an  algebraic  model  of  LC 
as  follows  [10]. 

1.  If  an  interval  is  a  model  of  a  logic,  the  direct  sum  of  the  intervals  is  also  a 
model  of  the  logic[6].  The  logical  operations  are  done  componentwise.  There¬ 
fore,  since  an  interval  [0,1]  is  an  algebraic  model  of  LC,  a  direct  sum  of 
intervals  [0,  l]”^(m  is  dimension)  is  also  an  algebraic  model  of  LC. 

2.  The  multilinear  function  space  is  a  linear  space,  therefore,  a  subset  of  the 
space  [0, 1]”^  is  a  direct  sum  of  the  interval  [0,1]. 

3.  From  item  1  and  2,  the  subset  [0, 1]”^  of  the  multilinear  function  space  is  an 
algebraic  model  of  LC. 
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Theorem  6  LC  is  complete  for  the  hypercube  [0, 1]”^  of  the  space.  The  defini¬ 
tions  are  as  follows: 

f  <  g  =  Vi(/i  <  9i),  f  Ag  =  (Min(fi,gi)),  f  Vg  =  {Max{fi,gi)), 


fDg={fiD  9i) 

/  =  (/i  3  0) 

where  f  and  g  stand  for  logical  vectors, 
above  discussions.  The  proof  is  omitted. 


This  theorem  is  understood  from  the 


Example 

/  =  0.6xy  +  O.lx  +  O.ly  +  0.1 
is  transformed  to 
0.9rry  +  0.2a:y  +  Q.2xy  +  0.1^, 
therefore,  /  =  (0.9, 0.2, 0.2, 0.1). 

In  the  same  way, 

7  =  {fi  D  0)  =  (0.9  D  0, 0.2  D  0, 0.2  D  0, 0.1  D  0)  =  (0, 0, 0, 0) . 

Therefore,  from  Theorem  6, 

/v7=(0.9,0.2,0_.2,0.1), 
which  means  /  V  /  7^  1 

This  example  shows  that  the  law  of  excluded  middle  /  V  /  =  1, 

which  holds  in  classical  logic,  does  not  holds  in  LC.  If  /  is  limited  to  Boolean 

functions,  the  law  of  excluded  middle  holds.  For  example,  let  /  be  xy,  that  is, 

/  =  1.0a;y  +  O.Oa?  +  O.Oy  +  O.O  =  l.Oxy  +  O.Orry  +  O.O^y  +  0.0^, 

^en  /  =  (1,0, 0,0), 

f  =  (fiD  0)  =  a  D  0,0  3  0,0  3  0,0  3  0)  =  (0,1, 1,1). 

Therefore,  /  V  /  =  (1, 1, 1, 1),  that  is,  /  V  /  =  1. 


4  Logical  reasoning  of  neural  networks  by  LC 

Neural  networks  are  multilinear  functions  and  the  multilinear  function  space  is 
an  algebraic  model  of  LC.  Therefore,  neural  networks  can  be  reasoned  by  LC. 
The  domain  is  {0, 1}^,  where  n  is  the  number  of  variables. 

Let  and  N2  be  two  trained  neural  networks,  which  have  3  layers,  two 
inputs  X  and  y,  two  hidden  units,  and  one  output.  The  output  function  of  each 
unit  is  a  sigmoid  function.  The  following  tables  show  the  training  results  of  weight 
parameters  and  biases  of  Ni  and  the  training  results  of  weight  parameters  and 
bias  of  N2. 


Ni  is  as  follows: 

5(7.615(-4.87a;  -  4.86y  -  6.70)  -  3.8SS(-2.86x  -  2.88y  +  3.50)  +  4.50), 


unit 

wl(w3,  w5) 

w2(w4,w6) 

bias 

hidden  1 

-4.87 

-4.86 

hidden  2 

-2.86 

-2.88 

output 

7.61 

-3.83 

0^ 

unit 

wl(w3,  w5) 

w2(w4,w6) 

bias 

hidden  1 

4.80 

4.72 

-2.31 

hidden  2 

-3.49 

-3.56 

1.67 

output 

5.81 

-4.62 
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From  the  above  formula,  the  logical  vector  is  calculated  as  follows: 
(0.98,0.01,0.01,0.00). 

The  logical  vector  of  N2  is  calculated  in  the  same  way  as  follows: 
(0.02,0.98,0.98,0.99). 

The  logical  conjunction  of  the  two  logical  vector  is  as  follows: 
(0.02,0.01,0.01,0.00), 
which  is  nearly  equal  to  0. 

The  multilinear  function  is  as  follows: 

0.02xy  +  0.01a:(l  -  y)  +  0.01(1  -  x)y  +  0.00  =  0.01a:  +  O.Oly. 

The  function  is  nearly  equal  to  0.  The  above  result  shows  that  the  logical 
conjunction  of  two  trained  neural  networks  is  almost  false,  which  cannot  seen 
from  the  training  results  of  neural  networks.  Ni  has  been  trained  using  x  Ay 
and  N2  has  been  trained  using  the  negation  of  x  Ay.  Therefore,  the  logical 
conjunction  of  Ni  and  N2  is  as  follows: 

Ni  A  N2  ^  {x  Ay)  A  {x  Ay)  =  0. 

As  seen  in  the  above  example,  the  logical  reasoning  of  neural  networks  shows  the 
logical  relations  among  neural  networks.  If  the  components  of  logical  vectors  are 
0  or  1,  the  calculation  can  be  done  by  Boolean  algebra,  that  is,  classical  logic. 
However,  even  if  the  training  targets  are  Boolean  functions,  the  training  results 
of  neural  networks  are  not  0  or  1,  but  are  values  like  0.01  or  0.98.  These  numbers 
cannot  be  calculated  by  Boolean  algebra,  but  can  be  calculated  by  LC.  In  the 
above  example,  the  training  targets  are  Boolean  functions  for  simplification. 
However,  any  function  can  be  the  training  target  of  neural  networks  and  any 
trained  neural  network  can  be  reasoned  by  LC.  The  logical  implication  between 
a  neural  network  and  another  neural  network  can  be  calculated  in  a  similar  way 
as  in  the  above  example. 

The  computational  complexity  of  the  logical  reasoning  is  exponential.  There¬ 
fore,  efficient  algorithms  are  needed,  which  have  been  developed[12].  However, 
due  to  space  limitations,  the  efficient  algorithms  cannot  be  explained  in  this 
paper.  They  will  be  explained  in  another  paper. 

5  Remarks  on  pattern  reasoning 

Patterns  can  be  regarded  as  functions  and  the  functions  can  be  approximated 
by  neural  networks.  Neural  networks  can  be  reasoned  by  a  few  logics  such  as 
LC,  Lukasiewicz  logic  and  product  logic.  Therefore,  pattern  reasoning  can  be 
realized  by  logical  reasoning  of  neural  networks.  However,  there  are  a  lot  of  open 
problems  for  pattern  reasoning  to  be  applied  to  real  data. 

1.  Computational  complexity 

A  basic  algorithm  is  exponential  in  computational  complexity,  therefore,  a 
polynomial  algorithm  is  needed.  A  polynomial  algorithm  for  a  unit  in  a 
neural  network  has  been  presented.  For  networks,  an  algorithm  which  uses 
only  big  weight  parameters  has  been  presented.  The  reduction  of  computa¬ 
tional  complexity  is  included  in  future  work. 


332 


2.  Appropriate  logics  for  pattern  reasoning 

Probability  calculus  is  similar  to  a  reasoning  for  patterns,  although  it  does 
not  have  the  formal  system.  Probability  calculus  does  not  satisfy  the  con¬ 
traction  rule  [7]: 

.  x^x-^y 

contraction  - . 

x-^y 

Therefore,  appropriate  logics  for  pattern  reasoning  should  not  satisfy  the 
contraction  rule.  Prom  this  viewpoint,  Lukasiewicz  logic  and  product  logic, 
which  do  not  satisfy  the  contraction  rule,  are  more  appropriate  than  LC, 
which  satisfies  the  contraction  rule.  It  is  desired  that  probabihty  calculus  be 
formalized  logically,  but  this  is  very  difficult.  We  are  investigating  appropri¬ 
ate  logics  for  pattern  reasoning. 

3.  Typical  patterns 

There  are  countless  patterns,  and  some  patterns  are  appropriate  for  pattern 
reasoning,  while  other  patterns  are  inappropriate.  Therefore  a  dictionary  of 
patterns  is  necessary.  The  patterns  included  in  the  dictionary  are  typical  pat¬ 
terns,  which  cannot  be  described  linguistically.  The  typical  patterns  can  be 
gathered  by  various  methods,  but  we  do  not  have  to  be  seriously  concerned 
with  gathering  typical  patterns,  because  pattern  reasoning  is  flexible  as  ex¬ 
plained  in  the  next  item.  However,  gathering  typical  patterns  are  important 
for  efficient  pattern  reasoning. 

4.  A  difference  between  pattern  reasoning  and  symbol  reasoning  in  the  reason¬ 
ing  mechanism 

In  symbol  reasoning,  when  the  left  side  of  a  rule  is  not  matched,  the  rule 
does  not  work,  while,  in  pattern  reasoning,  even  when  the  left  side  of  a  rule 
is  not  matched,  the  rule  works.  For  example,  let  a  rule  be  o  ->  6  and  the 
left  side  of  the  rule  be  a'.  If  a  is  very  similar  to  a',  the  truth  value  of  the 
rule  is  almost  1.  On  the  other  hand,  if  a  is  very  different  from  o',  the  truth 
value  of  the  rule  is  almost  0.  Pattern  reasoning  works  like  this,  because  the 
pattern  reasoning  makes  use  of  continuously  valued  logics.  There  are  several 
other  methods  which  deal  with  matching  degrees  of  the  left  sides  of  rules. 
However,  the  methods  are  basically  arbitrary,  whereas  the  pattern  reasoning 
presented  in  this  paper  includes  the  matching  degrees  in  the  system. 

5.  Formal  system 

In  pattern  reasoning,  for  example,  a  question  like  “Is  this  pattern  logically 
deduced  from  the  set  of  rules  of  patterns?”  should  be  answered.  Therefore, 
formal  systems  are  needed  for  pattern  reasoning. 

6.  Incompleteness 

In  mathematical  logic,  completeness  is  important.  In  reality,  humans  can¬ 
not  reason  or  prove  true  things,  that  is,  humans  are  incomplete.  Therefore, 
pattern  reasoning  should  deal  with  incompleteness. 

7.  The  relationship  with  probability  theory 

Probability  calculus  deals  with  continuous  values,  but  probability  events  are 
not  continuous,  that  is,  the  objects  of  probabihty  theory  are  not  continuous, 
while  the  objects  of  pattern  reasoning  are  continuous.  Therefore,  pattern 
reasoning  can  be  regarded  as  an  extension  of  probability  calculus. 
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8.  Experimental  study 

The  most  typical  patterns  are  images,  therefore  the  final  target  is  the  reason¬ 
ing  of  images.  We  have  to  begin  experiments  with  simple  examples.  We  have 
tried  to  realize  pattern  reasoning  for  one-dimensional  data,  for  example,  time 
series  data,  by  logical  reasoning  of  neural  networks  using  LC,  Lukasiewicz 
logic  or  product  logic.  The  results  show  that  the  logical  reasoning  of  neural 
networks  works  well,  which  will  be  reported  in  another  paper. 

6  Conclusions 

This  paper  has  presented  pattern  reasoning,  which  is  a  new  solution  for  the 
knowledge  acquisition  problem.  Knowledge  acquisition  tried  to  acquire  linguistic 
rules  fi:om  patterns.  In  contrast,  we  have  tried  to  modify  logics  to  reason  pat¬ 
terns.  Patterns  are  represented  as  functions,  which  are  approximated  by  neural 
networks.  Therefore,  the  logical  reasonings  of  neural  networks  have  been  studied. 
A  few  logics  can  reason  neural  networks.  This  paper  has  explained  intermedi¬ 
ate  logic  LC.  There  are  a  lot  of  open  problems,  therefore  the  author  strongly 
encourages  the  readers  to  join  the  research  field. 

References 

1.  D.V.  Dalen:  Intuitionistic  Logic,  Handbook  of  Philosophical  Logic  III,  D.  Gahbay 
and  F.Guenthner  eds.,  pp.225-339,  D.Reidel,  1984. 

2.  M.  Dummett:  A  Propositional  Calculus  with  Denumerable  Matrix,  The  Journal 
of  Symbolic  Logic,  Vol.24,  No.2,  pp.97-106,  1959. 

3.  M.  C.  Fitting:  Intuitionistic  Logic-Model  Theory  and  Forcing,  North  Holland,  1969. 

4.  P.  Hojek:  Metamathematics  of  Fuzzy  Logic,  Kluwer,  1998. 

5.  K.Hornik:  Multilayer  Feedforward  Networks  are  Universal  Approximators,  Neural 
Networks,  Vol.2  pp.359-266,  1989. 

6.  T.  Hosoi  and  H.  Ono:Intermediate  Propositional  Logics(A  Survey), Joumo/  of 
Tsuda  College,  Vol.5,  pp.67-82,  1973. 

7.  H.  Ono  and  Y.  Komori:  Logics  without  the  contraction  rule,  J.  Symbolic  Logic  50, 
pp.169-201,  1985. 

8.  H.  Tsukimoto  and  C.  Morita:  The  discovery  of  propositions  in  noisy  data,  Machine 
Intelligence  13,  pp. 143-167,  Oxford  University  Press,  1994  . 

9.  H.  Tsukimoto:  Continuously  Valued  Logical  Function  Satisfying  All  Axioms 
of  Classical  Logic,  Systems  and  Computers  in  Japan,  Vol.25,  No.  12,  pp.33-41, 
SCRIPTA  TECHNICA,  INC.,  1995. 

10.  H.  Tsukimoto:  The  space  of  multi-linear  functions  as  models  of  logics  and  its 
applications.  Proceedings  of  the  2nd  Workshop  on  Non-Standard  Logic  and  Logical 
Aspects  of  Computer  Science,  1995. 

11.  H.  Tsukimoto  and  Chie  Morita:  Efficient  algorithms  for  inductive  learning- An 
application  of  multi-linear  functions  to  inductive  learning.  Machine  Intelligence 
14,  pp.427-449,  Oxford  University  Press,  1995. 

12.  H.Tsukimoto:Symbol  pattern  integration  using  multilinear  functions,  in  Deep  Fu¬ 
sion  of  Computational  and  Symbolic  Processing,  (eds.)  Puruhashi,T.,  Tano,.S.,  and 
Jacobsen,H.A.  Springer  Verlag,  1999.  (To  appear) 


Probabilistic  Inference  and  Bayesian  Theorem 
Based  on  Logical  Implication 


Yukari  Yamauchi^  and  Masao  Mukaidono^ 


Dept,  of  Computer  Science,  Meiji  University, 
Kanagawa,  Japan 
{yukari,  masao}Ccs .meiji. ac ,jp 


Abstract.  Probabilistic  reasoning  is  an  essential  approach  of  approx¬ 
imated  reasoning  to  treat  uncertain  knowledge.  Bayes’  theorem  based 
on  the  interpretation  of  a  If-Then  rule  as  the  conditional  probability 
is  widespread  in  applications  of  probabilistic  reasoning.  A  new  type  of 
Bayes  theorem  based  on  the  interpretation  of  a  If-Then  rule  as  the  logical 
implication  is  introduced  in  this  paper,  where  addition  and  subtraction 
are  employed  in  the  probabilistic  operations  instead  of  multiplication 
and  division  employed  for  the  conditional  probability  of  the  traditional 
Bayes’  theorem.  Inference  based  on  both  interpretations  of  the  If-Then 
rules,  conditional  probability  and  logical  implication,  are  discussed. 


1  Introduction 

In  propositional  logic,  the  truth  values  of  propositions  are  given  either  l(true)  or 
O(false).  Inference  based  on  propositional  (binary)  logic  is  done  using  inference 
rule  :  Modus  Ponens,  shown  in  Fig.  1.  This  rule  implies  that  if  an  If-Then  rule 
“A  5”  and  proposition  A  are  given  true(l)  as  premises,  then  we  come  to  a 
conclusion  that  proposition  B  is  true(l). 


A-^B 

A 

B 

Fig.  1.  Modus  Ponens 


The  inference  rule  based  on  propositional  logic  is  extended  to  probabilistic 
inference  based  on  probability  theory  in  order  to  treat  uncertain  knowledge. 
The  truth  values  of  propositions  are  given  as  the  probabilities  of  events  that 
take  any  value  in  the  range  of  [0, 1].  Here,  U  is  the  sample  space  (universal  set), 
A,B  C  f/  are  events,  and  the  probability  of  “an  event  A  happens”,  P(A)  is 
defined  as  P(A)  =  \A\/\U\  (|C/|  =  1,  |A|  =  a  €  [0,1])  under  the  interpretation 
of  randomness.  Thus  the  probabilistic  inference  rule  can  be  written  as  Fig.  2 
adapting  the  style  of  Modus  Ponens. 
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P(A  ■^B)  =  i 
P(A)  =  a 

P(B)  =6  <.  <■, »  £  [0, 1] 

Fig.  2.  Probabilistic  Inference 


If  the  probability  of  A  B  and  A  are  given  1  (i  =  a  =  1),  then  b  is  1,  since 
the  probabilistic  inference  should  be  inclusive  of  modus  ponens  as  a  special 
case.  Our  focus  is  to  determine  the  probability  of  B  from  the  probabilities  of 
A  B  and  A  that  take  any  value  in  [0, 1] .  .A  —>  is  interpreted  as  “if  A 
is  true,  then  B  is  true”  in  meta-language.  Traditional  Bayes’  theorem  applied 
in  many  probability  system  adopts  conditional  probability  as  the  interpretation 
of  If-Then  rule.  However,  the  precise  interpretation  of  the  symbol  is  not 
unique  and  still  under  discussion  among  many  researchers. 

E.  Trillas  and  S.  Cubillo  [1]  remarked  the  inequality  x  ‘  {x  y)  <  y  (where 
X  y  =  x^  V  y)  valid  in  an  arbitrary  Boolean  algebra  (B,  V,  ,0, 1),  and 
determined  Boolean  variants  of  modus  ponens  by  replacing  conjunction  (•)  and 
implication  (— >)  by  other  truth  functions.  Nilsson  [2]  presented  a  semantical 
generalization  of  ordinary  first-order  logic  in  which  the  truth  values  of  sentences 
can  range  between  0  and  1.  He  established  the  foundation  of  probabilistic  logic 
through  a  possible- world  analysis  and  probabilistic  ent ailment.  However,  in  most 
cases,  we  are  not  given  the  probabilities  for  the  different  sets  of  possible  worlds, 
but  must  induce  them  from  what  we  are  given. 

Our  goal  is  to  deduce  a  conclusion  and  its  associated  probability  from  given 
rules  and  facts  and  their  associated  probabilities  through  simple  geometric  anal¬ 
ysis.  The  probability  of  the  sentence  “if  A  then  B”  is  interpreted  in  two  ways: 
conditional  probability  and  the  probability  of  logical  implication.  In  this  paper, 
we  define  the  probabilistic  inferences  based  on  the  two  interpretations  of  “If- 
Then”  rule,  conditional  probability  and  logical  implication,  and  introduce  a  new 
variant  of  Bayes’  theorem  beised  on  the  logiced  implication. 


2  Inference  Based  on  Probability  Theory 

2.1  Conditional  Probability 

Conditional  probability,  “how  often  B  happens  when  A  is  already  (or  necessary) 
happens”,  only  deals  with  the  event  space  that  A  certainly  happens.  Thus  the 
sample  space  changes  from  U  to  A. 

P(A  -^B)  =  P{B\A)  =  |A  n  B|/|A|,  (1) 

ic  =  P(AnB)/a.  (a^O)  (2) 

Since  P(A  Pi  B)  =  t'c  x  o  from  Equation  (2),  the  possible  size  of  B  is  restricted 
from  |A  n  J5|  =  ic  X  a  to  U  B|  =  1  —  (a  —  a  x  ic).  Thus  the  probabilistic 
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inference  based  on  the  interpretation  of  if-then  rule  as  the  conditional  probability 
determines  P(S)  from  given  P(  A  5)  and  P(A)  by  the  following  inference  style 
in  Fig.  3. 


P(A  ->  B)  = 

I^OBI  P(A)  =  o 

=  a  X  ic  _ _ 

P(B)  e  [o  X  i„  1  -  a(l  -  ic)] 


Fig.  3.  Conditional  Probability 


Note,  P(B)  can  not  be  determined  uniquely  &om  P(A  -)■  B)  and  P(A)  thus 
expressed  as  the  interval  probability  [3].  When  the  condition,  axic  =  1— <i(l  — i«) 
(thus  a  =  1),  holds,  P(B)  is  unique  and  equal  to  P(A  — >•  B). 


P(A  B)  =  u 
P(A)  =  1 

P(B)  6  [1  X  ic,  1  -  1  +  1  X  t^]  =  [ic,  ic]  =  ic 
Fig.  4.  Conditional  Probability  a  —  1 


2.2  Logic£il  Implication 

The  interpretations  of -4  (implication)  in  logics:  propositional  (binary  or  Boolean)logic, 
multi-valued  logic,  fuzzy  logic,  etc.,  are  not  unique  in  each  logic.  However,  the 
most  common  interpretation  of  -A  -4  5  is  V  B. 

P(A  ^  B)  =  P(A-=  U  B)  ==  |A'  U  B|/|17|,  (3) 

i,  =  P(AnB)  +  (l-a).  (a  +  t,  >1)  (4) 

In  order  to  avoid  contradiction  in  premises,  the  relationship  between  a  and  tj 
must  hold  the  condition:  o  -j-  i/  >  1. 

Since  P(^  n  5)  =  a  -  (1  -  if)  from  Equation  (4),  the  possible  size  of  B  is 
restricted  from  |^nH|  =  a- (1  -  if)  to  =  if.  The  probabilistic  inference 

based  on  the  interpretation  of  if-then  rule  as  the  logical  implication  determines 
P(H)  as  the  interval  probability  from  given  P(Al  -4  B)  and  P(^)  as  shown  in 
Fig.  5. 

Similar  to  the  conditional  probability  case  shown  in  Fig.  4,  P(5)  is  unique 
and  equal  to  P(A[  B)  when  the  condition  if  -|-  a  -  1  =  if  (thus  a  =  1)  holds. 
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1  -  ii 


\AnB\ 


(i-n) 


P(A-^  B)=ii 
P(A)  =  a 

P(B)  €  [a  —  (1  —  ii),  it] 


Fig.  5.  Logical  Implication 


3  Bayes’  Theorem 

Bayes’  theorem  is  widespread  in  application  since  it  is  a  powerful  method  to  trace 
a  cause  from  effects.  The  relationship  between  a  priori  probability  P(^  — >■  B) 
and  a  posteriori  probability  P(B  ^  yl)  is  expressed  in  the  following  equation  by 
eliminating  P(A  H  B)  from  the  definitions. 

F(A  -^B)  =  P(B|^)  =  P(^  n  5)/P(>l), 

P(B  -^A)  =  P(^|B)  =  P(^  n  B)/P(B), 

P(B  ->A)  =  P{A)  X  F(A  B)/P(B)  (5) 

Theorem  3.01  The  interpretation  of  — >  as  the  logical  implication  satisfies  the 
following  equation. 


P(B  A)  ==  P(A)  P(A  ^  B)  -  P(B)  (6) 

Proof.  Given  P(A  —^B)  =  ii^  P(^)  =  cii  and  P(J9)  =  6, 

P(B-^^)  =  P(B‘^U>1) 

=  1  -  P(J5  n  A^) 

=  l-(b-(a~l  +  ii)) 

=  1  —  (6  —  a  H- 1  —  if) 

~  a ii  —  b 

=  P(A)  4-  P(^  -^B)-  P(B). 


□ 


1  -  ii 


a  -  (1  -  »,) 


|j1«  n  B| 

=  k  -  («  -  1  +  n ) 


Fig.  6.  Bayes*  Theorem  with  Logical  Implication 
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Note,  the  new  variant  of  the  Bayes’  theorem  based  on  logical  implication 
adopt  addition  +  and  subtraction  —  where  the  traditional  one  adopt  multi¬ 
plication  X  and  division  /.  This  property  is  quite  attractive  in  operations  on 
multiple- valued  domain,  and  simplicity  of  calculation.  Farther  discussion  is  to 
apply  this  new  variant  of  the  Bayes’  theorem  to  the  systems  that  employ  logical 
implication. 


4  Inference  Applying  Bayes’  Theorem 


4.1  Bayes’  Inference  Based  on  Conditional  Probability 

Now,  we  apply  Bayes’  theorem  as  the  inference  rule,  and  define  P{B  A)  from 
P(A  B),  P(i4),  and  P(-B).  The  inference  based  on  the  traditional  Bayes’  the¬ 
orem  (conditional  probability)  is  shown  in  Fig.  7. 


P(^  B)=ic 
P(A)  =  a 
P(B)=6 

P(B  ,4)  =  a  X  ic/b 


Fig.  7.  Bayes’  Inference  -  Conditional  Probability 


The  condition,  max((a-|-6—  l),0)/a  <  ic  <  must  be  satisfied  between 
the  probabilities  a,  6,  and  ic*  Since  ie  =  P(.4  Pi  B)/ a  thus  max(a  4-  6  —  1, 0)  < 
P(^  n  B)  <  min(a,6). 

From  P(^  B)  and  P(^),  P(B)  is  determined  as  the  interval  probability 
by  the  inference  rule,  Fig.  3  in  the  previous  discussion.  Thus  P(B  -4  can  be 
determined  as  follows  when  P(B)  is  unknown. 


P(A  ^  B)  =  ic 
P(>1)  =  a 

P(B)  G  [a  X  ic,  1  -  a(l  -  ic)] 
P(B  ^  i4)  G  [a  X  ic/1  -  a(l  -  ic),  1] 

Fig.  8.  P(B)  :  unknown 


P(B  — >  i4)  is  unique  (P(B  -4  A)  =  1)  when  a  x  ic  —  1  —  a(l  —  *c)j  that  is 
a  =  1.  Note,  P(B  -4  A)  does  not  depends  on  ic  =  P(^  B). 
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P(^)  =  1 

P(B  -^A)  =  l 

Fig.  9.  P{^)  =  1 


4.2  Bayes’  Inference  Based  on  Logical  Implication 

Similarly,  applying  the  new  variant  of  Bayes’  theorem  based  on  logical  implica¬ 
tion,  we  get  the  following  inference  rule  in  Fig.  10. 


T?{A  -^B)=  ii 
P(^)  =  a 
P(B)=6 

P(B  ^  j4)  =  a  -I-  iz  “  t 

Fig.  10.  Bayes’  Inference  -  Logical  Implication 


The  condition,  max(6, 1  -  a)  <  ii  <  1  -  a  -1-  6,  must  be  satisfied  between 
the  probabilities  a,  6,  and  i/.  Since  ii  =  V{Ar\  B)/a  thus  max(a  -f-  6  —  1,0)  < 
P(^  n  B)  <  min(a,  h). 

P(B)  is  determined  from  F{A B)  and  P(i4)  by  the  inference  rule.  Fig.  5. 
Thus  P(B  can  be  determined  as  follows  when  P(B)  is  unknown.  Note, 

the  result  of  inference  does  not  depend  on  the  probability  of  P(^  — )■  B).  Clearly, 
P(B  A)  is  unique  (P(B  -4  ^)  =  1)  when  a  =  1. 


P(^  ^  B)  =  ii 
P(.4)  =  a 

P(B)G[a  +  ii-l,ii] 
P(B  ->  i4)  G  [o,  1] 

Fig.  11.  P(B)  imknown 


5  Generalization  on  Interval  Probability 

Since  the  results  of  inferences  are  given  as  interval  probability,  we  shall  discuss 
the  inference  methods  when  the  probability  of  sentences  are  given  as  the  interval 
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probabilities.  Given  the  set  of  all  interval  value  J, 

X  =  {[a,  I  0  <  a  <  h  <  1} 

the  interval  probability  of  “A  happens”  is  P(A)  €  [ai,a2]. 

5.1  Interval  Probabilistic  Inference 

In  the  previous  section,  the  probabilistic  inference  bcised  on  the  conditional  prob¬ 
ability  determines  P(5)  G  [a  x  ici  1  —  <^(1  —  4)]  from  given  P(i4  ->  J5)  =  4  and 
P(A)  =  a.  Thus,  given  P(A  B)  and  P(A)  as  interval  probabilities  [ici,ic2] 
and  [oi,a2],  the  possible  probability  of  P(5)  is  minimum  ai  x  ici  and  maximum 
1  —  ai(l  -  ic2). 


P(A  B)  £  [ici,ic2] 

P(A)  6  [01,02] 

P(B)  G  [01  X  ici,  1  -  ai(l  -  ic^)] 


Fig.  12.  Probabilistic  Inference  Based  on  Conditional  Probability 


Similarly,  the  probabilistic  inference  based  on  the  interpretation  of  if-then 
rule  as  the  logical  implication  determines  P(jB)  from  given  P(A  —^B)£  [*/i,  i/2] 
and  P(A)  G  [oi,  ^2]  as  shown  in  Fig.  13.  P(-B)  is  minimum  if  both  P(A  B)  and 
P(A)  takes  minimum  value  il\  and  ai.  However,  in  order  to  avoid  contradiction, 
the  condition  a  il  >  1  must  be  satisfied  between  any  combination  of  the 
probabilities  a  and  il .  Thus  the  minimum  value  of  P(H)  is  restricted  to  max(ai-|- 
i/i,  1)  —  1  =  max(ai  -j-  i/i  —  1, 0). 


P(A  — >  B)  G  [i/iji^2] 
P(A)  G  [01,02] 
P(5)  E  [oi  -{‘ill  —  1>  ih] 


Fig.  13.  Probabilistic  Inference  Based  on  Logical  Implication 


Note,  in  both  cases,  the  results  of  inference  P(-B)  does  not  depends  on  a2 
(the  maximum  value  of  P(A)). 

5.2  Inference  on  Interval  Probability  based  on  Bayes’  Theorem 

Now,  we  apply  Bayes’  theorem  as  the  inference  rule  on  interval  probabilities. 
Given  P(A  — JB)  as  the  conditional  interval  probability,  P(jB  —{A)  is  deter¬ 
mined  from  P(A  — JB)  G  [*Ci,ic2],  P(A)  G  [^1,02]  P(B)  G  [hi,  62]  hy  the 
Bayes’  theorem  as  shown  in  Fig.  14. 
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P(i4  ->•  B)  6  [iciyic2] 

P(>4)  G  [01,02] 

P(B)  G  [61,62] 

P(B  i4)  G  [oi  X  ici/62,02  X  ic2/6i] 
Fig.  14.  Bayes*  Inference  -  Conditional 


Note,  the  same  condition  in  previous  section  4.1,  max((a  +  6  —  l),0)/a  < 
4  <  must  be  satisfied  between  any  combinations  of  the  probabilities  a,  6, 
and  ic. 

P(B)  is  determined  as  the  interval  probability  from  P{A  B)  and  P(.A)  by 
the  inference  rule  as  shown  in  Fig.  12  in  the  previous  discussion.  Thus  P(B  -4  .4) 
can  be  determined  as  follows  when  P(B)  is  unknown. 


P(i4  B)  G  lici,ic2] 

P(^)  £  [oijOa] 

P(B)  G  [01  X  ici,  l-ai(l-ic2)] 


P(B  ^  4L)  G  [oi  X  ici/1  —  ai(l  -  ici),  1] 

Fig.  15.  Bayes’  Inference  -  Conditional;  P(B)  unknown 

P(B  A)  =  1  when  ai  x  ici  —  1  -  ai(l  -  ici),  that  is  ai  =  1.  P(B  ->  >4) 
does  not  depends  on  P(i4  — B). 

Similarly,  applying  the  new  variant  of  Bayes’  theorem  based  on  logical  impli¬ 
cation  on  interval  probabilities,  'P(A  -4  B),  P(.4)  and  P(B),  we  get  the  following 
inference  rule  in  Fig.  16. 


P(i4  -4  B)  G  [iliiil2] 

F{A)  G  [01,02] 

P(B)  G  [61,62] 

P(B  — >  j4)  G  [oi  ill  —  62, 02  4  ih  —  ^1] 

Fig.  16.  Bayes*  Inference  -  Logical  Implication 

The  same  condition  in  previous  section  4.2,  max(6, 1  —  a)  <  ii  <  1  —  a-\~h 
must  be  satisfied  between  the  probabilities  a,  6,  and  if. 


342 


P(B)  is  determined  as  the  interval  probability  from  'P{A  B)  and  P(^)  by 
the  inference  rule  as  shown  in  Fig.  13  in  the  previous  discussion.  Thus  P(B  ^  -A) 
can  be  determined  as  follows  when  P(B)  is  unknown. 


P(A^  B)e\iluil2] 
P(>4)  G  [01,02] 
P(B)  G[oi  -[-ih-l.ih] 


p{B  A)  e  [01,  1] 


Fig.  17.  Bayes’  Inference  -  Logical  Implication:  P(B)  unknown 


Note,  the  result  of  inference  does  not  depend  on  the  probability  of  P(^  J5). 

P(5  — ^  =  1  when  ai  =  1. 

6  Conclusion 

Inference  based  on  probability  theory  is  discussed  as  a  method  of  approximated 
reasoning  that  treat  uncertain  knowledge.  A  new  type  of  Bayesian  theorem  based 
on  the  interpretation  of  a  If-Then  rule  as  the  logical  implication  is  introduced. 
The  new  variant  of  the  Bayes’  theorem  based  on  logical  implication  adopt  ad¬ 
dition  A  and  subtraction  —  where  the  traditional  one  adopt  multiplication  x 
and  division  /.  This  property  is  quite  attractive  in  consideration  of  operations 
on  multiple-valued  domain,  and  simplicity  of  calculation.  Interesting  topic  for 
farther  discussion  should  be  to  apply  this  new  variant  of  the  Bayes’  theorem  to 
the  systems  that  employ  logical  implication. 
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Abstract.  This  article  presents  a  neural  network  approach  for  human 
reasoning.  It  is  based  on  a  three- valued  Boolean  logic.  We  will  first  laying 
down  the  foundations  for  study  of  a  neural  logic  and  represent  it  by  a  neural 
logic  network.  We  than  realize  the  process  of  reasoning  by  the  structure 
of  a  iicuro  model.  The  nodes  represents  the  function  of  reasoning  and 
the  connection  weights  the  parameter  of  reasoning.  The  model  is  close  to 
realization  of  the  particular  application.  The  goal  of  this  research  is  to 
develop  a  reasoning  system  capable  of  human  reasoning  based  on  neural 
logic  network. 


1  Introduction 

Rule-based  systems  have  been  successfully  applied  in  many  domains.  However, 
the  current  rule-based  technology  is  generally  lacking  in  learning  capability  and 
parallism.  It  is  also  weak  in  dynamic  reasoning,  control,  and  uncertainty  process¬ 
ing.  The  reasoning  proceeds  through  a  pre  defined  tree.  Neural  logic  networks, 
on  the  other  hand,  lend  themselves  very  well  to  learning  and  parallelism  due  to 
their  self-organization  features.  In  additions,  it  is  also  capable  of  incorporating 
temporal  reasonings,  and  certainty  factors  with  ease.  In  spite  of  these  capabil¬ 
ities,  however,  neural  logic  networks  basically  are  confined  to  relatively  simpler 
problem.  They  generally  are  deficient  in  the  in-depth  reasoning  afforded  by  the 
rule  based-based  systems.  In  view  of  the  above,  one  of  the  goals  of  current  neural 
logic  networks  research  is  to  bridge  the  gap  between  symbolic  and  sub-symbolic 
approaches  with  the  fusion  of  hybrid  systems.  We  propose  a  computional  model 
that  combines  the  best  of  the  both  worlds  by  integrating  an  ordinary  rule-based 
system  into  a  neural  logic  network  architecture.  The  model  contains  a  neural 
logic  based  inference  engine  that  dynamically  chains  active  rules  together  into  a 
neural  logic  network.  Parallel  execution  of  rules  becomes  possible  when  active 
rules  are  being  chained  along  different  path.  The  system  further  exhibits  learning 
capability  by  allowing  weights  to  be  adjusted  during  training  sessions. 

Logic  has  traditionally  been  one  of  the  foundations  for  symbolic  paradigm.  A 
neural  logic  network  model  that  represents  propositional  truth  values  as  neural 
activations  and  logical  operations  as  connection  weights,  has  been  proposed  by 
[4, 2]  to  represent  logic  and  perform  logical  inference  by  the  structure  and  dynam¬ 
ics  of  the  network  respectively.  The  underlying  neural  logic  network  demonstrates 
a  multitude  of  logical  operations,  besides  the  standard  operators  such  as  AND, 
OR,  NOT,  NOR  and  NAND.  Particularly,  the  user  is  free  to  define  any  opera¬ 
tions  to  meet  any  specific  needs.  The  strengths  of  neural  logic  thus  provide  a 
much  greater  expressive  power  for  the  systems* s  rules  syntax  than  a  ordinary 


344 


rule-based  systeuis.  We  present  a  rule  inferencing  system  whereby  the  internal 
representation  and  the  inferencing  mechanism  of  propositional  rules  are  driven 
by  a  neural  logic  network.  We  base  our  discussion  on  a  framework  for  deductive 
systems  in  A.I.,  namely  the  logic  level,  the  calculus  level,  the  representation  level, 
and  the  control  level.  We  will  show  that  neural  logic  does  enrich  the  meaning 
logic  with  DONT  KNOW  truth  value  and  human-like  logical  operators  which  are 
more  appropriate  for  knowledge  processing  and  decision  making. 

The  paper  is  organized  as  follows.  We  start  next  section  with  introducing 
the  basic  of  neural  logic  network  and  give  the  definitions  of  standard  and  non 
standard  logical  operators.  We  describe  than  the  human  logical  reasoning.  We 
conclude  with  a  discussion  of  the  presented  method  and  look  ahead  to  extension 
and  future  directions  of  this  work. 

2  Neural  Logic  Network 

Neural  logic  network  is  a  class  of  artificial  neural  network  which  is  used  to  model 
human  intelligence  by  computing  systems.  It  can  model  classical  two-valued 
Boolean  logic  effectively.  This  logic  is  in  fact  a  good  model  to  study  human  logic 
which  is  multivalued,  fuzzy  and  biased.  The  neural  logic  network  considered  in 
this  work  is  inspired  from  [4). 

A  Neural  Logic  Network  (NLN  for  short)  is  a  finite  directed  graph.  It  contains 
a  set  of  input  nodes  and  output  nodes.  Every  node  can  take  one  of  the  three 
order  pair  activation  values:  (1,0)  for  true  (0,1)  for  false  and  (0,0)  for  unknown. 
Every  edge  in  net  is  also  associated  with  an  ordered  pair  weight  (t,  /)  where  t 
and  /  are  real  numbers  of  positive,  negative  or  zero  value. 

2.1  Mathematical  definition  of  NLN 

An  abstract  neural  logic  network  is  a  mathematical  system  with  following  fea¬ 
tures: 

•  It  is  a  finite  directed  graph  consisting  of  a  set  of  nodes  N  and  a  set  of  links 

•  A  non-empty  subset  /  of  iV  is  chosen  as  input  nodes.  Another  non-empty 
subset  O  of  AT  is  chosen  as  output  nodes.  Other  nodes  are  called  hidden 
nodes; 

•  An  algebraic  system  <  i?,  -h,  x  >  which  is  satisfying  the  axioms  of  a  ring. 

An  association  of  a  set  of  links  to  a  set  of  R  is  defined  by  a  mapping  tpi 
(pi  t  E  li 

That  is  to  say,  every  links  of  the  directed  graph  is  given  a  value  from  the 
chosen  ring  <  il,+,  x  >; 

•  A  subset  of  A  is  chosen  from  the  ring  R  together  with  a  specially  chosen 

mapping  ^2  from  the  set  of  all  non-input  nodes  (i.e.  N  I )  to  the  set  A 
iP2  I  (N  I)  A 

That  is  to  say,  non-input  node  of  N  is  given  a  value  in  A.  The  elements  in 
A  are  to  be  called  truth-values; 
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Figure  1:  a)  3- valued  NLN,  b)  Boolean  NLN 

•  A  mapping  from  Rto  A  i.e. 

/: 

called  the  threshold  function. 

An  abstract  neural  logic  can  now  be  denoted  by 

Net  /, 0, R,Ay(pi,(p2j  > 


By  changing  ring  R  and  the  threshold  function  different  sub-classes  of  neural  logic 
network  can  be  obtained.  Three- valued  NLN,  Boolean  NLN  and  Fuzzy  NLN  are 
the  three  main  sub-class  of  NLN.  For  instance  the  3- valued  NLN  can  be  obtained 
by  representing  the  value  of  input  P{{i  =  by  ordered  pairs  with 

the  weights  (oi,6f)  and  the  value  of  output  Q  by  an  ordered  pair  (x^y).  Letting 
A  =  =  {(1,0),  (0,1),  (0,0)},  (where  means  the  truth  value  set  in  3- 

valued  NLN)  be  the  truth  value  set,  where  (1,0),  (0, 1),  (0,0)  represent  true,  false 
and  don*t  know  respectively,  a^,  6i  be  any  real  numbers  and  the  threshold  function 


can  be  defined  as  ^  ^ 

f  (1, 0)  if  E<=i  ”  E<=i  ^iVi  ^  1 

(3^1?/)  =  %  (Ojl)  if  -  ^  1 

[  (0, 0)  otherwise 

Fig.  1  a)  shows  the  general  structure  of  a  single  3-valued  NLN. 

Boolean  NLN  is  the  simplest  type  of  NLN.  Its  theory  plays  a  special  role 
because  of  its  link  to  other  well  known  neural  networks  such  as  multi-layer  per- 
ceptron  (3|,  Kohonen  nets  [Ij,  etc.  In  a  boolean  NLN,  single  real  numbers  a<  are 
used  for  weights  of  the  links  and  boolean  numbers  1  or  0  are  used  for  value  of 
input  Pi  (i  =  1,  ...n)  and  output  Q  .  The  truth  value  set  is  then  denoted  by 
(  where  B  for  Boolean  NLN)  and  -  (0, 1}  .  The  threshold  function 
is  defined  as 

(1  if  rL^aiPi>l 

I  0  if 

Suppose  we  are  given  a  boolean  neural  logic  network  and  suppose  Q  is  one  of 
its  nodes  with  incoming  links  such  as  Fig.  1  b) 

To  find  the  value  of  Q  we  need  to  find  the  current  values  of  nodes  at  Pi  yP^yPz 
,  say  01,02,03  respectively.  Then  we  find  sum  x  =  2oi  -02  +  03  and  put  this 
number  x  into  the  threshold  function  f(x)  to  decide  whether  it  should  be  1  or  0. 

The  choice  of  weights  associated  with  NLN  offers  a  great  variety  of  different 
logic  operations.  In  theory,  for  a  network  with  two  inputs,  total  of  3®  distinct 
meaningful  binary  logical  operations  are  possible.  The  definition  of  AND,  OR 
and  NOT  are  as  follows: 
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•  An  AND  operation  of  n  inputs  is  defined  as  a  neural  logic  function,  written 
as  AND{PiyP2,....Pn)  or  Q  =  Pi  AN  D  P2  AN  D  ^  ..,Pn  with  the  weights 
(o„fti)  =  (A,n)  for  i  =  l,2,...n 

•  An  OR  operator  of  n  inputs  is  defined  as  neural  logic  function  as  OR{Pi ,  P2,  —Pn) 
or  Q  =  Pi  OR  P2  ORy...Pn  with  the  weights  (ai,bi)  =  (n, for  t  = 
l,2,...n 

•  A  NOT  operation  of  1  input  is  defined  as  a  neural  logic  function,  written 
as  NOT(P)  or  Q  =  NOT(P)  with  the  weight  (a,  6)  =  (-1,  -1). 

Fig.  2  shows  several  useful  operations  in  3- valued  NLN. 

3  Logical  Reasoning 

In  section  2,  we  have  introduced  the  Neural  Logic  model  and  its  capability  to 
incorporate  the  local  inference  of  Boolean  logic.  The  interconnection  of  this 
model  called  Neural  Logic  Network  (for  short  NLN).  In  this  section  we  introduce 
a  rule  inferencing  system  based  on  neural  logic  model  for  propositional  knowledge 
base. 

A  proposition  is  represented  as  a  neural  logic  neuron  labelled  Q.  The  truth 
value  of  Q,  denoted  as  t(Q),  is  given  by  neuron’s  activation.  The  truth  values: 
true,  false,  and  don*t-know  are  denoted  by  ordered  pair  (1,  0),  (0,  1)  and  (0, 
0)  respectively.  The  connection  weight  from  a  neuron  denoting  proposition  Q  is 
also  extended  to  ordered  pair  (ar,  y)  where  x  and  y  are  real  numbers  that  can 
be  viewed  as  the  truth  and  false  value  or  as  the  strength  of  the  support  and 
opposition  respective  of  proposition  P  for  proposition  Q. 

Definition  Given  proposition  Pi.Pa, . with  truth  values 

j  yi)»  (®2 >  2/2)*  •— (®n>  |/n)»  which  are  connected  to  proposition  Q  with  the  weights 
(oi>^>i)>(a2j^'2)>*»*(an,^>n)  respectively.  The  Neiinput{Q)  is  defined  as, 
Neiin'put(Q)  =  -  biyi).  The  activation  of  neuron  is  defined  as  follows: 

((1,0)  if  Netinput{Q)  >  \ 

(0, 1)  if  Netinput{Q)  <  -A 

(0, 0)  otherwise 

where  A  is  threshold,  usually  set  to  1.  The  ’s  and  Q  in  Fig.  1  are  referred 
as  inputs  nodes  and  output  node  respectively. 

3.1  Neural  Logic  Element 

A  Neural  Logic  Element  (for  short  Netelm)  can  be  seen  as  one  layer  or  maximum 
two  layers  neural  network  with  n  input  nodes  and  one  output  node,  and  an 
optional  layer  of  hidden  nodes.  With  reference  to  Fig.  1  the  definition  of  a 
neural  logic  element  can  be  given  as  follows. 

Definition  A  neural  logic  element  of  n  inputs  with  the  proposition 

PiyP2 . Pn  connected  to  proposition  Q  is  defined  as 

Netelm:  {(1,0),(0,1),(0,0)}«  ^  {(1,0),(0,1),(0,0)} 

{(1,0)  if  Netinput(Q)  >1 

(0,1)  if  NetinputiQ)  < -I 

(0, 0)  otherwise 


347 


ThcORoperatJon 


O  Q-PU  P2 


(1/2,2) 

The  AND  operation 


PIQ - Q--1  P 

The  NOT  operation 


Vk(Pl.P2,  ....Pn) 


PI 

P2 

PI  vP2 

(1.0) 

(1.0) 

(1.0) 

(1.0) 

(0, 1) 

(1.0) 

(1.0) 

(0.0) 

(1.0) 

(0, 1) 

(1.0) 

(1.0) 

(0.1) 

(0,1) 

(0. 1) 

(0.1) 

(0,0) 

(0,0) 

(0.0) 

(1.0) 

(1.0) 

(0.0) 

(0. 1) 

(0,0) 

(0,0) 

(0.0) 

(0,0) 

PI 

P2 

P1aP2 

(1.0) 

(I.O) 

(1.0) 

(1.0) 

(0,1) 

(0. 1) 

(1.0) 

(0,0) 

(0,0) 

(0,1) 

(1,0) 

(0,1) 

(0,1) 

(0. 1) 

(0.1) 

(0, 1) 

(0.0) 

(0. 1) 

(0.0) 

(1.0) 

(0,0) 

(0,0) 

(0.1) 

(0.1) 

(0,0) 

(0.0) 

(0.0) 

p 

Q 

(1.0) 

(0.  1) 

(0. 1) 

(1.0) 

(0.0) 

(0.0) 

Q«  (1,0)  iff  at  !eaitkofPl,..J»nare(l,0) 
Q.(0,1)  irfaUPl,..Pnue(0.1) 

Qm  (0,0)  otherwise 


The  Al-Least-k  operation 


Vk 


Qa(l,0)  iff  there  are  k  more  (1,0)  than  (0, 1) 

among  die  vahiea  of  PI ,  P2 . Pn 

Q  -  (0, 1)  iff  there  are  k  more  (0, 1)  dtan  (1, 0) 
among  the  vahiea  of  PI,  P2,  .,.Pn 

Q  ■  (0, 0)  otherwise. 

Q«P1  ifPt*tC(0,0) 

Q  •  P2  if  P!  -  (0, 0)  A  P2  (0, 0) 

Q  -  P3  if  PI  -  P2-  (0, 0)  A  P3 1*^(0, 0) 


Q-Pd 


Figure  2:  Operations  in  3-valued  neural  logic 
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iicher(x,  y) 


(1/2.2) 


bet(er(x,  y) 


(1/2.2) 
a)  stronger(x,  y) 

Figure  3:  Exatnpl 
student 


Every  Netelm  has  two  equivalent  forms;  one  is  the  usual  textual  form  similar 
to  that  in  the  conventional  rule  based  system,  and  the  other  in  a  graphical  form 
that  pictorially  represents  the  network  element.  The  following  example  illustrates 
the  equivalence  between  a  rule  in  a  rule  based  system  and  Netelm. 

For  rule:  if  richer (x,  y)  AND  stronger(x,y)  then  better (x,  y)^  the  equivalent 
Netelm  is  shown  in  Fig.  3  a).  The  weights  attached  to  the  edges  correspond  to 
the  AND  connective  in  the  rule. 

Fig.  3  b)  shows  an  example  of  how  usefully  and  flexibly  a  neural  element 
can  be  used  to  model  human  inference  or  decision  pattern.  In  this  example  the 
priorities  of  Pi ’s  views  in  encoded  are  their  corresponding  weights: 

When  Pi  gives  his  view,  his  view  is  outcome; 

If  Pi  withhold  his  view  i.e.  Pi  (0,0),  then  P2  *s  view  will  be  the  outcome. 

P3  *8  view  will  be  the  outcome  only  when  both  Pi  and  P2  withhold  their  views. 
A  Netelm  in  Fig.  3  c)  represents  the  following  rule: 
if  Better-  7Iian(Academi(i-grade(x),A-minu8) 

AND  Min-Percentage(class-attendance(x),  95) 
then  Excellent-Student(x) 

3.2  Neural  logic  program 

Neural  logic  program  is  the  formal  representation  of  the  neural  logic  network. 
We  use  the  Horn  Claus  for  the  representation  of  facts  and  rules  in  knowledge 
base  expressed  in  terms  of  neural  logic.  The  Horn  Claus  can  easily  transformed 
to  a  Prolog  program. 

Definition  A  fact  clause  is  of  the  form:  Q,  where  Q  is  a  symbol  and  t{Q) 
denotes  truth  value  of  (J.  By  default  t{Q)  is  (1,  0).  This  allows  compatibility 
with  standard  Prolog  syntax. 

Definition  A  rule  clause  is  of  the  form: 

Q  :  -A(Pi,P2,*'-P„)  or  Q  :  -Piu>i,P2U/2,‘* 'Pn^/n 
where  Q  and  P<  *s  are  symbols  for  propositions,  A  is  a  neural  logic  function 
(eg.  AND, OR  etc.)  and  wt  is  of  the  forms:  <  Xi,yi  >  denoting  the  weight 
Vi)  from  Pi  to  Q  for  t  =  1, 2,  ...n.  When  A  is  a  null  string,  it  means  an  AND 
neural  element.  If  some  propositional  symbol,  such  as  Q,  appears  as  the  head 
of  more  than  one  clause,  the  Q ’s  are  interpreted  as  the  input  nodes  to  an  OR 
neural  element.  These  again  allow  compatibility  with  standard  Prolog  syntax. 
Wi  in  the  formula  is  for  specifying  arbitrary  weight  to  define  any  neural  logic 
function.  The  neural  logic  network  representation  for  a  rule  clause  will  be  the 
same  as  one  in  fig.l. 
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Figure  4:  Neural  logic  network  program  for  a)  a  fragment  clauses,  b)  an  example 

Defluitioii  A  program  clause  is  either  a  fact  clause  or  a  rule  clause.  A 
neural  logic  program  is  a  collection  of  program  clauses.  It  will  be  represented  as  a 
forest  of  neural  logic  net  trees,  each  consists  of  a  number  of  neural  elements  joined 
as  follows:  the  output  node  of  a  neural  element  becomes  an  input  node  of  another 
neural  element  if  they  both  denote  the  same  proposition.  The  truth  values  of  the 
proposition  as  specified  by  the  fact  clauses  are  the  activation  attached  beside  the 
relevant  neurons.  Thus,  every  neural  logic  program  has  a  unique  neural  logic 
network  representation  and  the  number  of  distinct  symbols  (propositions)  in  the 
program  equals  the  number  of  nodes  in  the  neural  logic  minus  the  number  of 
hidden  nodes  in  the  neural  elements. 

To  further  illustrate  the  implicit  AND  and  OR  operators  in  standard  Prolog 
program  and  its  corresponding  neural  logic  network  representation,  consider  the 
following  program  fragment: 

•  —P'll ,  P22>  •  •  •  P2n 

Qm  ,  — *  *  •  Pmn 

Where  refer  to  the  same  syntactical  symbol  Q,  m  hidden 

nodes,  will  be  created,  each  of  which  is  an  output  from  AND 

neural  element  of  n  inputs  (Pii,F<2r  •  *  Pin)  as  well  as  an  input  node  to  an  OR 
neural  element  of  m  input  as  shown  in  Fig.  4  a) 

Fig.  4  b)  shows  a  simple  neural  logic  program  in  standard  Prolog’s  syntax 
and  its  corresponding  neural  logic  network  representation. 

3.3  The  proposed  system 

A  schematic  diagram  of  the  system  is  shown  in  Fig.  5).  The  system  allows  conver¬ 
sion  from  if.. then.,,  rules  into  a  neural  logic  program.  Every  rule  is  represented 
in  the  knowledge  base  by  netelm.  Standard  logic  operations  in  conventional  rules 
are  readily  transformed  into  netelms  with  fixed  weights  assigned  for  the  corre¬ 
sponding  logic  operators. 

The  Rule  Editor  and  Query  Manager  combine  two  forms  of  user  interaction 
with  the  system.  Besides  providing  a  friendly  environment  for  the  user  to  convert, 
create  and  maintain  netelm  knowledge  base,  they  also  derive  the  conclusion. 

The  netelm  rule  base  is  the  depository  area  of  knowledge  to  be  used  in  the 
inference  process.  It  is  made  of  the  rules  from  three  main  sources:  conventional 
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RiA>hM»  K«t«*iaiw 


InfcrMBbiilM 


Figure  5:  Architecture  of  the  reasoning  system 


rules  being  transformed  into  netelms  by  the  rule  editor;  rules  added/modified  by 
the  uses;  and,  meta  rules  learned  by  the  inference  engine  through  consultations. 

The  Inference  Engine  dynamically  links  up  relevant  rules  from  the  netelm 
rule  base  in  the  process  of  deriving  at  a  conclusion  for  the  user.  The  inference 
engine  may  run  in  consultation  or  training  mode.  In  either  case  it  chains  active 
rules  together  into  a  neural  logic  network  like  a  tree  structure.  With  the  netelm 
schema  discussed  earlier,  tree  structure  is  essentially  made  of  fragments  of  neural 
logic  network  to  be  fitted  together  dynamically  during  the  inferencing  process, 
similar  to  the  chaining  of  rules  in  the  working  memory  of  a  conventional  rule- 
based  system. 

3.4  Learning 

Since  all  logical  operations  in  the  system  are  presented  by  arc  weights,  this  pro¬ 
vides  a  mean  of  modifying  the  rules  by  adjustment  of  weights.  This  is  especially 
useful  for  rules  that  involve  non  standard  logic  operations  such  as  that  of  human 
logic.  In  such  cases,  weight  may  initialy  be  created  on  the  basis  of  some  intuition 
but  later  tuned  by  training  with  examples. 

The  system  may  be  trained  in  two  ways.  First,  learning  mechanism  allows 
alternation  of  the  logical  operation  of  a  rule  by  training  a  corresponding  body  of 
connected  netelms  with  known  exemplary  deductions.  Second,  learning  mecha¬ 
nism  allows  fine  tuning  of  weights  without  altering  the  basic  logical  operations 
defined  in  the  netelms.  In  the  dynamic  linking  of  netelms,  either  forward  or  back¬ 
ward,  paths  that  fail  to  achieve  the  desired  conclusions  will  have  their  weights 
decreased.  Other  paths  that  confirm  the  desired  conclusions  may  have  their 
weights  increased.  In  training  mode,  when  a  training  example  is  presented  to  the 
system,  it  will  chain  up  an  inference  tree  from  the  input  nodes  to  the  desired  goal 
node.  Relevant  netelm  rules  are  selected  from  the  knowledge  bzise  depending  on 
the  input  variables  in  the  training  example.  After  the  activation  and  propaga¬ 
tion,  the  activation  state  of  the  network  output  node  is  compared  to  the  desired 
value  in  the  training  example.  If  the  network  is  not  able  to  derive  the  desired 
conclusion,  the  error  is  back  propagated  and  the  netelm  in  the  inference  tree 
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will  have  their  connection  weigiits  adjusted.  However,  the  system  assumes  that 
the  netelm  rules  articulated  by  the  human  experts  are  sufficiently  close  to  the 
global  minima  of  the  neural  network  representing  the  domain  knowledge.  The 
constituent  netelm  therefore  only  require  small  weight  adjustment  -  perhaps  out¬ 
put  nodes  of  certain  netelms  in  the  inference  network  fall  slightly  short  for  the 
threshold  of  the  neural  logic  network  activation.  This  means  that  the  iterative 
process  for  error  back  propagation  will  only  need  to  occur  a  small  number  of 
times. 

4  Conclusion 

We  have  described  a  rule  inferencing  system  based  on  neural  logic  model.  The 
model  provides  a  richer  set  of  logical  operations  which  arc  close  to  human  rea¬ 
soning  and  decision  making  that  is  not  easy  if  not  impossible  to  be  modeled  by 
classical  logic.  We  define  a  neural  logic  program  for  representation  of  the  specific 
application.  We  also  suggested  the  search  strategy  with  heuristic  search  and  an 
adaptive  strategy  for  standard  operators.  Comparing  to  the  rule  based  systems 
the  knowledge  is  usually  constructed  in  a  hierarchy.  However,  the  predefined  gen¬ 
eralization  hierarchy  limits  the  system  flexibility.  It  is  difficult  to  update  those 
assumptions  that  are  no  longer  significant.  In  this  approach,  when  a  query  is 
assigned  to  the  system,  it  will  be  mapped  dynamically  to  a  neural  network.  In 
doing  so,  the  topology  of  the  network  is  reduced  to  the  size  of  query. 

Furthermore,  the  conventional  rule  based  systems  often  lack  learning  ability. 
Finally,  the  system  is  made  more  resilient  to  the  brittleness  problem  of  conven¬ 
tional  rule  based  system  which  could  fail  abruptly  in  the  face  of  fuzzy  data 

FVom  the  above  discussion  end  examples,  it  is  not  difficult  to  envisage  the 
power  of  neural  logic  model  consisting  of  chain  of  neural  logic  elements  to  repre¬ 
sent  interesting  and  realistic  human  logic  which  is  not  possible  in  classical  logic. 
We  arc  at  the  beginning  of  the  project,  therefore  we  can  not  report  about  a  real 
world  application.  This  is  a  subject  to  future  work.  We  will  explore  the  limita¬ 
tions  of  this  approach  on  a  number  of  domains  and  we  hope  to  show  that  this 
idea  is  extendible  to  many  other  AI  problems. 
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The  Resolution  for  Rough  Propositional  Logic 
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Abstract 

Based  on  First-Order  Rough  Logic  Studied  by  Lin  and  Liu,  this  paper 
establishes  rough  propositional  logical  system  with  rough  lower  (L)  and 
upper  (H)  approximate  operators.  It  discusses  the  resolution  principle 
in  the  system.  The  soundness  of  resolution  deduction,  soundness  and 
completeness  of  the  refutation  are  also  studied  in  the  paper. 
KeywordsrRought  Propositional  Logic,  Resolution  Principle,  Resolution 
Deduction,  Soundness  Theorem,  Refutaion. 

1  Introduction 

Lin  and  Liu  studied  a  first-order  rought  logic  based  on  six  topological  properties, 
in  particularly,  using  the  axioms  of  Kuratowski’s  closure  (H)  and  interior  (L) 
operators.  Thus,  the  first-order  rough  logic  system  with  operators  L  and  H  is 
developedt^^.  The  revision  studied  further  by  Lin  and  Liu  considers  a  map  /,  it 
is  defined  as  the  map  of  one  to  one  between  boundary-line  region  in  the  logic 
and  undefinable  region  in  classical  logic.  Hence,  the  logic  is  proved  to  be  sound 
and  complete  with  the  new  intertation  in  the  revision^^^. 

We  bear  in  mind  the  idea  of  studying  resolution  reasoning.  This  paper  will 
first  establish  a  rough  propositional  logical  system  (RPLS)  with  operators  {L) 
and  (H)  which  are  defined  rough  lower  and  upper  approximate  operators^^^; 
Next,  the  paper  describes  the  resolution  principle  and  the  soundness  of  resolu¬ 
tion  deduction  in  the  logic.  Hence,  this  paper  is  different  from  other  systems,  it 
focuses  in  the  resolution  reasoning,  but  not  the  logical  systems  based  on  rough 
concept. 

2  Rough  Propositional  Logical  System 

Let  ty  be  a  rough  propositional  formula,  We  will  call  m{w)  C  U  the  meaning 
of  w.  Meaning  sets  of  the  formulas  Lw  and  Hw  with  rough  lower  and  upper 
approximate  operators  are  defined  as  follows: 

(1).  X  satisfies  w  iff  x€m{w)-, 

®*The  study  is  supported  by  national  natural  science  fund  and  JiangXi  Province  natural 
science  fund  in  China 
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(2) .  X  satisfies  Lw  iff  ^^ijiyeHix)  y€m:{w)),  i.e.,  H(x)  C  rh(u)); 

(3) .  X  satisfies  Hw  iK  3y{yeH(x) /\  yem{w)),  i.e.^  if(2;)p|m{ty)  ^  Pl. 

Where  H(x)  is  a  equivalent  calss  containing 

Well-formed  formulas(WfFs): 

(1) .  All  atomic  formulas  are  W ff; 

(2) .  If  w  and  uq  are  Wf  f,  then  so  are  ~  ly,  wi  and  Lw\ 

(3) .  The  only  Wff  are  those  obtainable  by  finite  applications  of  (1)  -  (2)  in 
the  above. 

Other  logical  connectives  V,  A,  are  defined  by  ~  and  — >  and  operator  H  is 
defined  by  ~  and 
Axiom  schemas 
Ai.  w  (lui  — ♦  w  ); 

Ao.  h  »  (li'i  ^’2))  (ly  — ^  wi)  — ^  (w 

A3.  H  (~  wi  — Wo)  —*  {wo  —*  u>i); 

A4.  h  L{wi  — *  1/^2)  — *  {Lwi  — *' Lwo)] 

A5.  h  Lw  w\ 

A^.  H Lw  —*  tw- 

Rules  of  inference 

Ri.  Modus  Ponens  (MP):From  h  twi  — +  wo  and  H  uq,  we  have  h  wo: 

Ro,  L  insertion  (LI):From  H  w^  we  have  H  Lw. 

Where  Ro  means  that  Lw  is  valid  if  w  is  valid  for  all  obserable  worldt^l 
The  semantics  of  the  logic 

The  .semantic  model  of  formulas  in  RPLS  i.s  defined  as  a  triple: 

M  W,  R,m  > 

where  W  is  n.  non-empty  set  of  observable  worldsl^l,  if  each  observable  world 
are  viewed  as  a  state,  then  W  is  a  state  setf^i.  72  is  a  binary  relation  on  W, 

such  that  VseW,  Ss'eW,  (s,  s')e72;  m  is  a  meaning  function  that  assigns  to  each 

propositional  variable  p  a  subset  m{p)  of  W. 

Given  a  model  M  we  say  that  formla  w  is  satisfied  by  a  state  s  in  model  717, 
written  by  t=a  w  iff  the  following  conditions  are  satisfied: 

(1) .  717  [=,  p  iff  sem(p),  where  p  is  a  propositional  variable; 

(2) .  717  wii{‘^M\=gW\ 

(3) .  717  |=j  toi  V  Wo  iff  717  uq  V  717  (=j  uq; 

(4) .  717  |=,  toi  A  Wo  iff  717  |=,  uq  A  717  |=2  ^^'2; 

(5) .  717  (=,  wi  Wo  iff  717  wi  V  wo\ 

(6) .  M  w\  Wo  iff  717  |=,  (uq  — »  Wo)  A  {w2  ^  uq); 

(7) .  717  Lw  iff  V^s'eW,  if  (s,s')€R  then  717  w; 

(8) .  717  t=,  Hw  iff  3s'eW,  (s^  s)ei2  A  M  w. 

Given  a  model  717,  for  each  formula  w  in  RPLS^  which  is  assigned  a  set  of 
states  in  model  717,  detoiied  by  m{uj)  =  :  717  u^}. 

We  introduce  truth  and  validity  of  formulas.  A  formula  vj  is  true  in  a  model 
717  iff  Tn{w)  =  W\  k  formula  w  is  valid  in  RPLS  iff  is  true  in  every  model  in 
RPLS;  a  formula  w  is  satisfiable  iff  for  some  model  717  and  state  s,  717  w;  If 
w  includes  operators  L  and  if,  the  description  is  also  validable  by  (7)  and  (8). 
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3  Conjunctive  Normal  Form  (CNF) 

Let  ly  be  a  formula  in  RPLS^  then  there  is  a  CNF  corresponding  to 

C\  A  C2  A  ...  A  Cji 

where  n  >  1  and  each  clause  Ci  is  a  disjunction  of  the  general  form: 

Ci  V  . ..  Vpn,  VLqi  V  ...  V  Lqn^  V  Ifri  V  ... 

where  each  p,  is  a  literal;  each  qj  is  a  disjunction,  it  possesses  the  general  form 
of  the  clauses;  r*  is  a  conjunction,  where  each  conjunct  possesses  the  general 
form  of  the  clauses. 

For  examples,  The  following  formulas  are  conjuctive  normal  forms: 

(1) .  L(p  V  g  V  A  t)); 

(2) .  fi’((pVg)A~p); 

(3) .  ~  p  V  p  V  L{r  V  s)  V  H{{p  V  Lr)  A  e); 

(4) .  -  p  V  L{Lp  V  {H{q  A  Lr))  V  H{L{H{{Lq  V  Ht)  A  r)  V  Lt)  A  p). 

Any  formula  w  in  RPLS  is  transformed  eqivalently  into  the  conjunctive  normal 
form.  For  example,  w  =  L{p  A  H{qW  L{r  A  t))  A  (p  — ►  L[q  A  Ht))) 

The  followings  are  the  procedure  that  is  trasformed  into  CNF: 

(1) .  Lp  A  LH{q\/  {Lr  A  Lt))  A  {p  ^  L{q  A  Ht))-  , 

(2) .  Lp  A  LH{{q  V  Lr)  A{qV  Lt))  A  (p  ->  {L{q  A  Ht)))- 

(3) .  Lp  A  LH {{q  V  Lr)  A  (g  V  Lt))  A  ('^  p  V  L{q  A  Ht))] 

(4) .  Lp  A  LH{{q  V  Lr)  A  (g  V  Lt))  A  (~  p  V  {Lq  A  LHt))] 

(5)  Lp  A  LH{{q  V  Lr)  A  (g  V  Lt))  A  (~  p  V  Lq)  A  (~  p  V  LHt). 

Where  each  conjunct  is  the  general  form  of  the  clause. 

4  The  Resolutions  in  the  RPLS 

For  any  two  clauses  C\  and  ^2,  if  there  is  literal  pi  in  Ci,  that  is  complementary 
to  a  literal  p2  in  C21  then  delete  pi  and  p2  from  Ci  and  C2  respectively,  and 
construct  the  disjunction  of  remaining  clauses.  Therefore,  we  have  resolution 
rule: 

C\  with  picCi 
C2  with  P2^C2 

(Cl-{pi})U(C2-{p2}) 

(I) 


It  is  possible  there  are  literals  with  operators  L  and  H  in  the  clauses  of  RPL^ 
and  Lp  is  complementary  pair  of  literals  to  ^  ~  p.  Hence  following  resolution 
is  valid: 
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Cl  with  LpeCi 
C2  with  H  ~  peC2 


(Ci-{ip})u(C2-{F~p}) 

(ii) 

The  forms  of  lower  of  level  line  of  (/)  and  (//)  are  called  resolvent  obtained 
from  Cl  and  C2.  Hence  the  resolution  principles  in  RPLS  consist  of  (/)  and 
(77).  As  an  example,  consider  the  following  deductive  resolution: 

(1)  L(p  V  g)  V  Cl  premise 

(2) F~  p  V  C2  premise 

(3)  L  ~  Q  V  C3  premise 


(4)  TTg  V  Cl  V  C2  using  (1)  and  (2) 

(5)  Cl  V  C2  V  C3  using  (3)  and  (4) 

Where  fourth  step  (4)  has  a  Hq^  since  premise  (1):  (i(p  V  q)  <->■  {LpV  Lq)), 
but  L(p\/  g)  A  7f  ~  p  — >  H{p  V  ~  p).  Hence,  the  resolution  of  using  7(p  V  q) 
and  H  ^  p  gets  the  resolvent  Hq. 

5  Transformable  strategies  of  the  Resolutions 

in  RPLS 

Let  Cl  and  C2  be  two  clausee  in  RPLS^  we  can  transform  for  them,  so  that  we 
find  out  the  complementary  pair  of  literals  in  Ci  and  C2.  Therefore,  we  give 
the  following  transformable  strategies: 

(1) .  r(p,~p)  = 

(2) .  r((C’i  V  C2),  C3)  =  -R(Ci,  Ca)  V  C2; 

(3) .  T(C'i  A  C2  A  C3  A  Ci)  =  iZ(Ci,  C3)  A  C2  A  C4; 

(4) .  T{LCi,LC2)  =  LR{Cx,C2y, 

(5) .  T(LCi,HC2)  =  HR{Ci,C2y, 

(6) .  T(LCi,C2)  =  R(Ci,C2y 

(7) .  T(Ci  V  C,  C2  V  C')  =  i7(Ci,  C2)  V  C  V  C'; 

(8) .  Substitution:  0  for  every  occurrence  of  (0  A  C);  C  for  every  occurrence  of 
(0  V  C);  0  for  every  occurrence  of  70  or 

Where  R(X,Y)  denotes  that  X  and  Y  is  resolvable. 

6  Soundness  of  Resolution  in  RPLS 

Theorem  1  (soundness  theorem)  If  there  is  a  deduction  of  resolution  of  a  clause 
C  from  a  set  of  clauses.  A,  then  A  logically  implies  C.  ' 

Proof  The  proof  is  achieved  by  simple  induction  on  the  longer  of  resolution 
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deduction.  For  the  induction,  we  need  to  show  only  that  any  given  resolution 
step  is  sound.  Suppose,  Ci  and  C2  are  arbitrary  clauses,  resolution  of  them 
produces  a  new  clause  C  :  (Ci  —  {pi})  U  (C2  —  {P2})  by  (J)  or  C'  :  (Ci  —  {Lp})  U 
(C2  —  {H  ~  p})  by  (II).  By  induction  assumption,  [=5  Ci  and  \=s  C2,  that  is, 
Cl  and  C2  are  true,  we  prove  \=s  C  and  C',  where  seW,  namely  C  and  C 
'  are  also  true. 

If  \=s  Lp,  then  S' p,  because  Lp  and  F  ~  p  is  a  complementary  pair  of 

literals  in  RPL^  and  so  \=s  {C2  —  {S  ~  p}).  If  1=^  if  p,  then  ip,  and  so 

\=S  (Cl  -  {ip})-  But  then  [=5  C',  that  is,  (Ci  -  {ip})  U  (C2  -  {F  ~  p}). 

Similarly,  we  obtain  C,  that  is,  1=^  (Ci  -  {pi})  U  (C2  -  {p2})- 

Given  a  set  of  clauses,  it  can  derive  empty  using  resolution  deduction,  we 
call  the  resolution  deduction  a  refutation.  Such  as,  given  the  three  clauses  ip, 
p  V  g)  and  i  ~  g,  the  deductive  steps  of  them  are  as  following: 

(1)  ip  premise 

(2) ^(~pVg)  premise 

(3)  i  ~  g  premise 

(4)  Hq  using  (1)  and  (2) 

(5)  0  using  (3)  and  (4). 

The  theorem  of  soundness  and  completeness  for  Resolution  refutation  is 
vaild,  that  is,  a  set  of  clauses,  A,  is  unsatisfiable  iff  A  is  refutable. 


7  Conclution 

We  study  the  resolution  of  RPL,  the  aim  is  in  order  to  establish  a  rough  reason¬ 
ing  system  using  resolution  method.  The  operators  i  and  H  in  the  paper  come 
from  rough  lower  and  upper  approximate  operator  defined  by  Lin  and  Liu  in  the 
referencesf^’^^  they  are  different  to  necessary  (□)  and  possible  (O)  operators  in 
Modal  Logic  in  the  interpretation  of  semantis. 
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Abstract.  We  propose  to  use  complex  information  granules  to  extract 
patterns  from  data  in  distributed  environment.  These  patterns  can  be 
treated  as  a  generalization  of  association  rules. 


1  Introduction 

Notions  of  granule  [15],  [9]  and  granule  similarity  (inclusion  or  closeness)  are 
very  natural  in  knowledge  discovery.  The  exact  interpretation  between  granule 
languages  of  different  information  sources  (agents)  often  does  not  exist.  Hence 
closeness  (rough  inclusion)  of  granules  is  considered  instead  of  their  equality. 

For  example,  the  left  and  right  hand  sides  of  association  rules  [1]  describe 
granules  and  the  support  and  confidence  coefficients  specify  the  inclusion  degree 
of  granule  represented  by  the  formula  on  the  left  hand  side  into  the  granule 
represented  by  the  formula  on  the  right  hand  side  of  the  association  rule. 

Reasoning  in  distributed  environment  requires  a  construction  of  interfaces 
between  agents  for  learning  of  concepts  definable  by  different  agents.  In  this  pa¬ 
per  we  suggest  one  solution  based  on  exchanging  views  of  agents  on  objects  with 
respect  to  a  given  concept.  An  agent  delivering  concept  is  giving  positive  and 
negative  examples  (objects)  with  respect  to  a  given  concept.  The  agent  receiv¬ 
ing  this  information  can  describe  objects  using  its  own  attributes.  In  this  way  a 
data  table  (called  a  decision  table)  is  created  and  the  approximate  description 
of  concept  can  be  extracted  by  the  receiving  agent. 

An  analogous  method  can  be  used  in  case  of  the  customer- agent  (agent 
specifying  tasks)  searching  for  a  top-level  cooperating  agent  (root-agent).  The 
customer- agent  is  presenting  examples  and  counter  examples  of  objects  with 
respect  to  her/his  concept.  The  concept  specified  by  customer-agent  is  approxi¬ 
mated  by  agents  and  an  agent  returning  the  best  approximation  of  the  customer- 
agent  concept  is  chosen  to  be  the  root  agent.  The  goal  of  cooperating  agents  is 
to  produce  a  concept  sufficiently  close  (or  included)  to  the  concept  specified  by 
the  customer-agent.  This  concept  has  to  be  constructed  from  some  elementary 
concepts  available  for  agents  called  inventory  or  leaf- agents  [8].  This  is  realized 
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by  searching  for  an  agent  scheme  [8].  The  schemes  are  represented  in  the  paper 
by  expressions  called  terms. 

We  emphasize  the  fact  of  approximate  (vague)  understanding  of  concepts 
received  by  any  agent  from  other  agents.  Our  solution  is  based  on  rough  set 
approach.  We  point  out  that  our  approach  can  be  treated  as  an  approach  for 
extracting  generalized  association  rules  in  distributed  environment. 

2  Rough  Sets  and  Approximation  Spaces 

We  recall  general  definition  of  approximation  space  [11],  [13]. 

Definition  1.  A  parameterized  approximation  space  is  a  system 
'^here 

-  U  is  a  non-empty  set  of  objects^ 

^  .  U  ^  p  is  an  uncertainty  function  and  P  (U)  denotes  the  powerset 

ofU, 

-  :  P(U)  X  P  {U)  [0, 1]  is  a  rough  inclusion  function. 

The  uncertainty  function  defines  for  every  object  x  a  set  of  similarly  described 
objects.  A  constructive  definition  of  uncertainty  function  can  be  based  on  the 
assumption  that  some  metrics  (distances)  are  given  on  attribute  values.  For 
example,  if  for  some  attribute  a  G  A  there  is  a  metric  Sa  :  Va  x  Va  — >  [0,  oo) , 
where  14  is  the  set  of  all  values  of  attribute  a  then  one  can  define  the  following 
uncertainty  function 

y  ^  it  (^)  if  if  (^)  ’  ^  (2/))  <  fa  (a  [x) ,  a  {y)) , 

where  /a  :  14  x  14  [Oj  00)  is  a  given  threshold  function. 

A  set  A  C  t/  is  definable  in  AS^^$  if  it  is  a  union  of  some  values  of  the 
uncertainty  function. 

The  rough  inclusion  function  defines  the  value  of  inclusion  between  two  sub¬ 
sets  of  U  [11],  [9]. 

Now  we  can  define  the  lower  and  the  upper  approximations  of  subsets  of  U. 

Definition  2.  For  a  parameterized  approximation  space  =  {Uyl#,u$) 

and  any  subset  X  CU  the  lower  and  the  upper  approximations  are  defined  by 
LOW  (A%,$,  X)  =  {xeU:iy$  (/#  (x) ,  X)  =  1}  , 

UPP  X)  ={xeU:u$  (/#  {x),X)  >  0}  . 

Approximations  of  concepts  (sets)  are  constructed  on  the  basis  of  background 
knowledge.  Obviously,  concepts  are  also  related  to  unseen  so  far  objects.  Hence 
it  is  very  useful  to  define  parameterized  approximations  with  parameters  tuned 
in  the  searching  process  for  approximations  of  concepts.  This  idea  is  crucial  for 
construction  of  concept  approximations  using  rough  set  methods.  In  our  notation 
#,$  are  denoting  vectors  of  parameters  which  can  be  tuned  in  the  process  of 
concept  approximation. 
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The  presented  above  definition  of  approximation  space  can  be  treated  as 
a  semantic  part  of  the  approximation  space  definition.  Usually  there  is  also 
specified  a  set  of  formulas  ^  expressing  properties  of  objects.  Hence  we  assume 
that  together  with  the  approximation  space  there  are  given 

-  a  set  of  formulas  #  over  some  language, 

-  semantics  ||•||  of  formulas  from  i.e.,  a  function  from  ^  into  the  power  set 

P{U). 

Let  us  consider  an  example  [7] .  We  define  a  language  Ljs  used  for  elementary 
granule  description,  where  IS  =  (U,  A)  is  an  information  system.  The  syntax  of 
Lis  is  defined  recursively  by 

1.  (a  €  U)  G  L/5,  for  any  A  and  V  C  T4. 

2.  If  a,  /5  G  Ljs,  then  a  A  /?  G  L/s- 

3.  If  a,  G  Ljs,  then  a  V  ^  G  L/s. 

The  semantics  of  formulas  from  Ljs  with  respect  to  an  information  system 
/5  is  defined  recursively  by 

1.  \\a  G  V\\js  =  {x  eU  :  a  (a:)  G  U} . 

2.  Ilo:  A /?||j5  =  ||q:||/5  n  ||^||j5  . 

3.  \\a  V  I3\\is  =  lloill/s  U  |j/?ij/5  . 

A  typical  method  used  by  classical  rough  set  approach  [7]  for  constructive 
definition  of  the  uncertainty  function  is  the  following:  for  any  object  x  eU  there 
is  given  information  Inf  a  (x)  (information  vector,  attribute  value  vector  of  x) 
which  can  be  interpreted  as  conjunction  of  selectors  a  =  a  (a;)  for  a  G  A  and  the 
set  I:ff:  (rr)  is  equal  to  ||  Aae^  ®  ®  consider  a  more  general  case 

taking  as  possible  values  of  (a:)  any  set  \\a\\jg  containing  x.  Next  from  the 
family  of  such  sets  the  resulting  neighborhood  /:j^  (a;)  can  be  selected.  One  can 
also  use  another  approach  by  considering  more  general  approximation  spaces  in 
which  7#  (a:)  is  a  family  of  subsets  of  U  [2],  [6]. 

3  Mutual  Understanding  of  Concepts  by  Agents 

One  of  the  important  task  for  Knowledge  Discovery  and  Data  Mining  (KDD) 
[1],  [4]  in  distributed  environment  is  to  develop  tools  for  modeling  mutual  un¬ 
derstanding  of  concepts  definable  by  different  agents.  Mutual  understanding 
through  communication  is  one  of  the  key  issues  to  enable  collaboration  among 
agents  [5].  We  assume  agents  specify  their  knowledge  using  data  tables. 


3.1  Understanding  of  Concept  Definable  by  Single  Agent 

Let  us  consider  two  agents.  There  are  two  data  tables  I  Si  =  {U,Ai)  and 
IS2  =  (U,  {a})  corresponding  to  agents.  We  assume  that  a  :  27  ->  {0, 1}  is  a 
characteristic  function  of  a  concept  X  =  {x  ^  U  :  a  {x)  =  1}  . 


In  this,  typical  for  rough  set  approach,  situation  the  first  agent  is  specifying 
the  characteristic  function  of  its  concept  on  examples  of  objects.  The  second 
agent  is  trying  to  describe  the  concept  using  values  of  its  own  attributes  from 
Ai  on  objects  considered  by  the  first  agent.  In  this  way  it  is  constructed  a 
decision  table  with  condition  attributes  from  Ai  and  the  decision  a.  Next  it  is 
computed  the  lower  and  the  upper  approximation  of  the  decision  class  X.  The 
size  of  the  boundary  region  of  X  with  respect  to  Ai  can  be  used  a  measure  of 
uncertainty  in  understanding  X  by  the  agent  with  attributes  Ai . 

Closeness  of  X  to  its  approximations  in  the  language  used  by  the  first 
agent  can  be  represented  by  accuracy  of  approximation,  i.e.,  by  the  coefficient 

card{LOW(ASA,,X)) 
a  ,  X  )  card{UPP{ASAj^  ,X))  ' 

The  presented  above  approach  can  be  used  for  learning  by  one  agent  of 
concepts  definable  by  another  agent. 

Let  us  consider  again  two  agents.  There  are  two  data  tables  /5i  =  {U,  Ai)  and 
IS2  =  (U,  A2)  corresponding  to  agents.  We  assume  that  in  both  data  tables  there 
is  the  same  set  of  objects  U  and  Ai  =  {a}, ...  ,  ,  and  A2  =  {of,. . .  ,  a|}  are 

two  sets  of  attributes,  where  I  >  0  and  A:  >  0  are  given  natural  numbers.  Let  us 
consider  concepts  definable  by  attributes  from  the  set  A2.  For  example  suppose 
that  we  consider  concept  defined  by  formula  (af  =  1  A  <2^  =  l)  V  af  =  1.  This  is 
a  concept  definable  by  the  second  agent.  Hence  this  agent  can  compute  values  of 
the  characteristic  function  of  the  concept  on  objects  from  U  and  the  first  agent 
can  find  approximations  of  the  concept  following  the  procedure  described  above. 

In  this  way  we  define  approximations  by  the  first  agent  of  concepts  definable 
by  the  second  one. 

Let  us  mention  that  the  approximation  operations  are  in  general  not  dis¬ 
tributive  with  respect  to  disjunction  or  conjunction.  Hence  one  can  not  expect 
to  construct  concept  approximations  of  the  good  quality  from  approximation  of 
atomic  concepts  (e.g.  descriptors). 


3.2  Understanding  of  Concept  Definable  by  Team  of  Agents 

Assume  that  a  set  of  agents  Ag  ~  {agi, . . .  , agp}  where  p  >  0  is  a  given  natural 
number.  Let  us  consider  a  data  table  ISag  =  {U,Aag)  for  any  agent  ag  e  Ag. 
We  assume  any  agent  from  Ag  is  defining  a  concept  X  using  the  above  proce¬ 
dure.  One  can  construct  a  decision  table  DT  with  condition  attributes  being 
the  characteristic  functions  of  the  lower  and  upper  approximations  of  X  defined 
by  all  agents  from  Ag  and  the  decision  being  the  characteristic  function  of  X 
on  given  examples  of  objects.  The  lower  and  upper  approximation  of  X  with 
respect  to  condition  attributes  of  DT  describe  the  vagueness  in  understanding 
of  X  by  agents  from  Ag.  One  can  also  use  other  features  summarizing  the  result 
of  voting  by  different  agents.  Examples  of  such  features  are  the  majority  voting 
feature,  accepting  object  as  belonging  to  concept  if  the  number  of  voting  agents 
is  greater  than  a  given  threshold  or  the  characteristic  function  of  the  intersection 
of  the  upper  approximations  ^^geAg  {^^Aag  5  or  the  intersection  of  the 
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lower  approximations  •  One  can  observe  that  in  some 

cases  the  above  intersections  can  be  undefinable  by  single  agent. 

The  described  problem  is  analogous  to  resolving  conflict  between  decision 
rules  voting  for  decision  when  they  are  classifying  new  objects. 

One  can  extend  our  approach  to  the  case  when  e.g,  one  agent  is  trying  to 
understand  concepts  deflnable  by  the  second  agent  on  the  basis  of  understanding 
these  concepts  by  the  third  agent.  Common  knowledge  of  a  given  team  of  agents 
about  concepts  definable  by  members  of  this  team  [3],  [14],  [10]  can  also  be 
considered  in  this  framework. 

One  can  also  consider  the  discussed  above  new  features  as  the  characteristic 
functions  of  concepts  definable  in  some  new  approximation  spaces  constructed 
from  approximation  spaces  of  agents  from  Ag. 


4  Rough  Sets  in  Distributed  Systems 


In  this  section  we  consider  operations  on  approximation  spaces  which  seem  to  be 
important  for  approximate  reasoning  in  distributed  systems.  We  consider  a  set  of 
agents  yip.  Each  agent  is  equipped  with  some  approximation  spaces.  Agents  are 
cooperating  to  solve  a  problem  specified  by  a  special  agent  called  customer-agent 
The  result  of  cooperation  is  a  scheme  of  agents.  In  the  simplest  case  the  scheme 
can  be  represented  by  a  tree  labeled  by  agents.  In  this  tree  leaves  are  delivering 
some  concepts  and  any  non-leaf  agent  ag  e  Ag  is  performing  an  operation  o  (op) 
on  approximations  of  concepts  delivered  by  its  children.  The  root  agent  returns  a 
concept  being  the  result  of  computation  by  the  scheme  on  concepts  delivered  by 
leaf  agents.  It  is  important  to  note  that  different  agents  use  different  languages. 
Hence  concepts  delivered  by  one  agent  can  be  only  perceived  in  an  approximate 
sense  by  another  agent. 

We  assume  any  non  leaf  agent  is  equipped  with  an  operation 
o{ag)  :  U^g  x  ...  x  uif  U^g  .  Any  agent  ag  together  with  an  operation 
o  (ag)  has  different  approximation  spaces  AS^ag  ,  •  •  •  AS^ag  ,  AS^g  with  universes 
U^g  , . . .  ,  Ui^g  ,  U^g  ,  respectively.  We  assume  that  the  agent  ag  is  perceiving 
objects  by  measuring  values  of  some  available  attributes.  Hence  some  objects 
can  become  indiscernible  [7].  This  influences  the  specification  of  any  operation 
o  (ap) .  We  consider  a  Cctse  when  arguments  and  values  of  operations  are  repre¬ 
sented  by  attribute  value  vectors.  Hence  instead  of  the  operation  o  {ag)  we  have 
its  inexact  specification  o*  (op)  taking  as  arguments  (a;i) , . . .  {xk)  for 
some  xi  G  G  and  returning  the  value  {o{ag){xi,. . .  ,Xk)) 

if  o{ag){xi^...^Xk)  is  defined,  otherwise  the  empty  set.  This  operation  can  be 
extended  to  the  operation  o*  {ag)  with  domain  equal  to  the  Cartesian  product  of 
families  of  definable  sets  (in  approximation  spaces  attached  to  arguments)  and 
with  values  in  the  family  of  definable  set  (in  the  approximation  space  attached 
to  the  result) 
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o*iag){Xi,...,X,)^  U  o*(ag)iYi,...,Y,), 

yiCXi,...,ncA'fe 

where  Vi , . . .  ,  are  neighborhoods  of  some  objects  in  definable  sets  ,  ,Xk, 
respectively.  In  the  sequel,  for  simplicity  of  notation,  we  write  o  (ag)  instead  of 
o*  (ag) . 

This  idea  can  be  formalized  as  follows.  First  we  define  terms  representing 
schemes  of  agents. 

Let  Xag,  yagi-  •  •  b®  agent  variables  for  any  leaf-agent  ag  E  Ag.  Let  o  {ag)  de¬ 
note  a  function  of  arity  k.  We  have  mentioned  that  it  is  an  operation  from  Carte¬ 
sian  product  of  Def  -Sets{AS^g), . . .  ,Def -Sets{AS^g)  into  Def-Sets{AS^g), 
where  Def.Sets{ASag)  denotes  the  family  of  sets  definable  in  AS^g.  Using  the 
above  variables  and  functors  we  define  in  a  standard  way  terms,  for  example 
t  =  o{ag)  {o{agi)  {Xag^.Yag^)  ,oiag2)  •  Such  terms  can  be  treated 

as  description  of  complex  information  granules.  By  a  valuation  we  mean  any 
function  val  defined  on  the  agent  variables  with  values  being  definable  sets  sat¬ 
isfying  val{Xag)  C  Uag  for  any  leaf-agent  ag  e  Ag.  Now  we  can  define  the  lower 
and  the  upper  values  of  any  term  t  under  the  valuation  val  with  respect  to  a 
given  approximation  space  of  an  agent  ag 

1.  If  t  is  of  the  form  Xag>  then  val  (jl>OWyAS^g^  (i)  =  LOW  (^AS^g  ,val{t)^ 

and  val  (uPP,ASi°^)  (t)  =  UPP  [AS^°J,val{t))  if  val{t)  C  Uag,  otherwise 
the  lower  and  the  upper  values  are  undefined. 

2.  If  i  =  o{ag){ti , . . .  ,4),  where  ii, . . .  ,  are  terms  and  o  {ag)  is  an  operation 
of  arity  k  then 


val  [LOW,ASi°j)  (t)  =  LOW  (^SW,o(ap)  (m/  [LOW,AS<Jg^)  (h) , 

...,val  {LOW,AS(^^)ih))), 

val  {uPP,ASif)  (t)  =  UPP  [ASif,o{ag)  (val  (UPPAS^)  (h) , 

...  ,val  (uPP,ASi'‘j)(tk))). 

if  val  (lOW,AS^^'^  iti),val  (uPP,ASif')  (U)  C  ui^  for  i  =  1 . k,  oth¬ 

erwise  val  (lOWjAS^J)  (t)  and  val  (uPP,AS^g^  (t)  are  undefined. 

Example  1.  We  assume  Ag  =  {ag,agi,ag2}  and  o{ag)  is  a  binary  operation  of 
ag.  Two  information  systems  ISag^,  ISag2  presented  in  Tables  l(a),(b)  describe 
input  information  granules.  We  also  consider  operation  o  {ag)  described  in  Table 

3.  Two  data  tables  DTi  =  {Ui,Ai  U  {di})  and  DT2  =  {U2,  A2  U  {^2})  described 
in  Tables  2(a)  and  2(b)  characterize  interfaces  between  agents  agi,ag2  and  ag. 
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di 

yi 

1 

y2 

0 

ys 

1 

y4 

0 

d2 

Zl 

1 

£2 

1 

£3 

1 

Z4 

0 

Table  1.  (a)  Information  System  ISagi  (b)  Information  System  ISag2 
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yes 

1 

Z2 

yes 

yes 

yes 

1 

Z3 

no 

no 

yes 

1 
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no 

no 

yes 

0 

Ui 

at 

ai 

ai 

di 

yi 

yes 

yes 

no 

1 

y2 

no 

yes 

no 

0 

ya 

no 

yes 

no 

1 

y4 

no 

no 

yes 

0 

Table  2.  (a)  Data  Table  DTi  (b)  Data  Table  DT2 


The  first  four  columns  of  Table  2(a)  (2(b))  define  information  system 
(IS^g)  corresponding  to  the  approximation  space  (A^i^^). 

Let  <  =  o  (ag)  (Xag, ,  Xag^ )  and  val  {Xag^ )  =  {2/1 , 2/3  }  •  Hence 

val  {lOW,AS^^^)  {Xag,)  =  LOW  {3/1, 2/3})  =  {yi}  , 

vcd  (i7PP,A5'‘>)  (X,,,)  =  UPP  (A5i';,{yi,y3})  =  {yi,y2,y3}  . 

Let  val  {Xag^)  =  {^i)^2,2^3}  •  Hence 

val  [lOW,AS'^^)  {Xag,)  =  LOW  =  {^1,^2}, 

val  (uPP,ASi^J^  (Xag^)  =  UPP(ASif,{zi,Z2,Zs}^  =  {Zl,Z2,Z3,Zi}  . 

We  obtain  o{ag)  ({yi} ,  {^1,^2})  = 

o(ag)  (||a}  =  yesAa\  =  yesAa\  =  no||^g(y  ,  ||oi  =  yesH^gw)  C  ||rf  =  +||^3(oy  • 


o(ag) 

Ea 

Ea 

Ea 

Ba 

El 

Ea 

a 

(yi,zi,wi) 

s 

EQI 

B 

B 

B 

ea 

(yUZ3,W2) 

B 

K3 

ea 

{y2,Z2,W3) 

B 

m 

B 

yes 

B 

aa 

{y3,Z4,W4) 

E 

QQI 

no 

E 

■ 

{y4,zi,ws) 

B 

B 

B 

B 

■ 

(y4,  Z4,W6) 

H3 

EQI 

ss 

■ 

Table  3.  Operation  o{ag) 
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The  support  of  the  rule  if  t  then  d  =  +  under  the  valuation  val  with  respect  to 
the  lower  approximations  is  equal  to  card  (^val  (^LOW^  AS^g^  (t)  H  ||d  =  ) 

=  1  and  the  confidence  is  also  equal  to  1. 

We  also  obtain  o(a^)  ({2/1, 2/2, 2/3}  ,  {^1,^2, 2^3, 2^4})  = 

oiag)  (||a2  =ye«||/5(y  -ll“f  =  no||^5(2))  =  {wi,w2,w3,w4} . 

The  support  of  the  rule  if  t  then  d=  +  under  the  valuation  val  with  respect  to 
the  upper  approximations  is  equal  to  card  ^val  ^17 PP,  (t)  O  ||d  =  "^ll/sip^ ) 

=  3  and  the  confidence  is  equal  to  0.75. 

Let  us  observe  that  the  set  val{UPP^  AS^g){t)  —  val  {LOW,  AS^g){t)  can  be 
treated  as  the  boundary  region  of  t  under  val.  Moreover,  in  the  process  of  term 
construction  we  have  additional  parameters  to  be  tuned  for  obtaining  sufficiently 
high  support  and  confidence,  namely  the  approximation  operations. 

A  concept  X  specified  by  the  customer-agent  is  sufficiently  close  to  t  under 
a  given  set  Val  of  valuations  if  X  is  included  in  the  upper  approximation  of  t 
under  any  val  £  Val  and  X  includes  the  lower  approximation  of  t  under  any 
val  e  Val  as  well  as  the  size  of  the  boundary  region  of  t  under  Val,  i.e.. 


is  sufficiently  small  relatively  to  O^aieVai  (uPP.ASi^^^  (i). 

We  conclude  by  formulating  some  examples  of  basic  algorithmic  problems. 

—  Synthesis  of  generalized  association  rules.  Searching  for  a  scheme  (term  t) 
over  a  given  set  Ag  of  agents  and  for  a  valuation  val  such  that  the  rule  if  t 
then  a,  where  a  is  a  concept  description  specified  by  customer- agent,  has 
the  support  at  least  s  and  the  confidence  at  least  c  under  the  valuation  val. 

-  Synthesis  of  complex  concepts  close  to  the  concept  specified  by  the  customer- 
agent.  Searching  for  a  scheme  (term  t)  over  a  given  set  Ag  of  agents  and  a 
set  Val  of  valuations  such  that  the  concept  specified  by  the  customer-agent 
is  sufficiently  close  to  t  under  Val  and  the  total  size  of  the  term  t  and  the 
set  Val  is  minimal. 


Conclusions 

Our  approach  can  be  treated  as  a  step  towards  understanding  of  complex  in¬ 
formation  granules  in  distributed  environment.  The  approximate  understanding 
of  concepts  definable  by  agents  in  the  language  of  other  agents  is  an  important 
aspect  of  our  approach  for  calculus  on  information  granules.  In  our  next  paper 
we  will  present  bounds  on  the  complexity  of  the  above  formulated  problems  as 
well  as  heuristics  for  solving  them. 
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Abstract.  This  paper  proposes  an  evolutionary  approach  for  discover¬ 
ing  difference  in  the  usage  of  words  to  facilitate  collaboration  among 
people.  When  people  try  to  communicate  their  concepts  with  words, 
the  difference  in  the  meaning  and  usage  of  words  can  lead  to  misun¬ 
derstanding  in  communication,  which  can  hinder  their  collaboration.  In 
our  approach  each  granule  of  knowledge  in  classification  from  users  is 
structured  into  a  decision  tree  so  that  difference  in  the  usage  of  word  can 
be  discovered  as  the  difference  in  the  structure  of  tree.  By  treating  each 
granule  of  classification  knowledge  as  an  individual  in  Genetic  Algorithm 
(G A) ,  evolution  is  carried  out  with  respect  to  the  classification  efficiency 
of  each  individual  and  diversity  as  a  population  so  that  difference  in  the 
usage  of  words  will  emerge  as  the  difference  in  the  structure  of  decision 
tree.  Experiments  were  carried  out  on  motor  diagnosis  cases  with  artifi¬ 
cially  encoded  difference  in  the  usage  of  words  and  the  result  shows  the 
effectiveness  of  our  evolutionary  approach. 


Ke3rwords:  usage  of  words,  classification,  decision  tree,  evolutionary  approach 


1  Introduction 

In  accordance  with  the  need  for  dealing  with  large-scale  and  complex  problems,  it 
is  required  to  support  collaborative  works  among  people  through  the  interaction 
between  people  and  machine  since  people  often  form  a  team  to  tackle  such  a 
problem.  Generally  different  people  seem  to  have  different  ways  of  conception 
and  thus  can  have  different  concepts  even  on  the  same  thing.  When  people  try 
to  communicate  their  concepts  with  abstract  or  vague  words,  such  a  difference  in 
the  meaning  or  usage  of  words  can  lead  to  misunderstanding  in  communication, 
which  can  hinder  their  collaboration.  Difference  in  the  usage  of  words  can  also 
be  reflected  on  the  description  of  data  in  large  scale  databases,  which  are  often 
be  constructed  with  the  participation  of  many  people. 

This  paper  proposes  an  evolutionary  approach  for  discovering  difference  in 
the  usage  of  words  to  facilitate  collaboration  among  people.  Although  words 
can  be  utilized  to  represent  meaning  at  various  level,  i.e.,  abstract  or  concrete 
meaning  in  general,  this  paper  focuses  on  dealing  with  the  difference  in  the  us¬ 
age  of  symbol  for  concept.  When  users  specify  their  classification  knowledge  as 
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diagnosis  cases,  which  are  represented  with  the  symbols  for  attributes,  values 
and  classes,  they  are  structured  into  decision  trees  so  that  the  trees  reflect  their 
conceptual  structure  of  diagnostic  knowledge.  By  treating  each  granule  of  clas¬ 
sification  knowledge  as  an  individual  in  Genetic  Algorithm  (GA),  this  paper 
proposes  to  carry  out  evolution  with  respect  to  the  classification  efficiency  of 
each  individual  and  diversity  as  a  population  so  that  the  difference  in  the  usage 
of  words  will  emerge  as  the  difference  in  the  structure  of  decision  tree. 

2  Framework  of  Discovering  Difference  in  the  Usage  of 
Words 

2.1  Difference  in  the  Usage  of  Words 

This  paper  focuses  on  dealing  with  the  difference  in  the  usage  of  words  which 
represent  conceptual  meaning  at  the  symbol  level.  Hereafter  we  call  such  a  dif¬ 
ference  as  “conceptual  difference” .  Usually  different  symbols  are  used  to  denote 
different  concepts,  however,  the  same  symbol  can  be  used  to  denote  different 
concepts  depending  on  the  viewpoint  in  which  the  symbol  is  used.  In  contrast, 
different  symbols  can  be  used  to  represent  the  same  concept.  The  types  of  con¬ 
ceptual  difference  dealt  with  in  this  paper  are  defined  as  follows: 

—  Type  1:  different  symbols  are  used  to  denote  the  same  concept. 

-  Type  2:  the  same  symbol  is  used  to  denote  different  concepts. 

Suppose  there  are  an  expert  in  electric  engineering  called  Adam  and  an  expert 
in  mechanical  engineering  called  Bob.  When  they  carry  out  the  diagnosis  of 
motor  failure,  Adam  might  point  out  “anomaly  in  voltage  frequency”  and  Bob 
might  point  out  “anomaly  in  revolutions”  for  the  same  symptom.  The  above  two 
symbols  or  terms  represent  different  concepts  in  general,  however,  they  can  be 
considered  as  denoting  the  same  concept  when  they  are  used  in  the  context  of 
the  diagnosis  of  motor  failure. 


2.2  Discovering  Conceptual  Difference 

The  kind  of  problem  which  can  be  dealt  with  in  our  approach  is  classification 
such  as  diagnosis  and  the  class  of  cases  is  determined  based  on  the  attributes 
and  values,  which  characterizes  the  cases  [3].  The  system  tries  to  construct  the 
decision  tree  for  cases  which  is  most  effective  for  their  classification  based  on  the 
information  theory.  A  node  in  decision  trees  holds  the  attribute  to  characterize 
cases.  Each  link  below  a  node  holds  the  value  for  the  attribute  at  the  node  and 
cases  are  divided  gradually  by  following  links.  The  class  of  cases  is  determined 
at  the  leaf  which  is  reached  as  the  result  of  link  following.  We  utilize  IDS  al¬ 
gorithm  [5]  to  construct  decision  trees  since  it  is  fast  and  thus  is  suitable  for 
interactive  systems. 

The  system  architecture  which  incorporates  the  descovering  method  in  this 
paper  is  shown  in  Fig  1.  Currently  the  system  requires  that  two  users  represent 
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their  knowledge  as  cases  with  their  respective  symbols  for  them.  By  accepting  the 
cases  as  input  the  system  constructs  decision  trees  for  them  and  tries  to  detect 
conceptual  differences  in  attributes,  values  and  classes  based  on  the  structural 
differences  in  trees.  Since  there  are  2  types  of  conceptual  differences  for  3  entities, 
the  system  tries  to  discover  6  kinds  of  conceptual  differences  and  shows  the 
candidates  for  them  in  the  descending  order  of  the  possibility  to  users.  The 
system  also  displays  the  decision  trees  for  cases.  Visualizing  their  knowledge  as 
concrete  decision  trees  is  expected  to  help  them  modify  their  knowledge  and  to 
stimulate  further  conception.  Based  on  the  result  from  the  system  users  discuss 
each  other  to  change  their  concepts  toward  reducing  conceptual  differences  and 
modify  input  data  to  the  system.  The  above  processes  are  repeated  interactively 
to  remove  conceptual  difference  gradually.  In  future  we  plan  to  extend  the  system 
so  that  it  be  applicable  to  more  than  two  users. 
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Fig.  1.  System  Architecture 


2.3  Discovery  Algorithms  for  Conceptual  Difference 

The  algorithms  are  designed  based  on  the  role  of  the  symbols  for  the  classification 
of  cases,  which  is  discovered  from  the  structural  characteristics  of  decision  trees 
This  section  briefly  explain  the  key  idea  for  discovering  conceptual  difference  in 
our  approach.  The  details  of  the  algorithm  are  described  in  [4,  2,  7].  Hereafter 
the  set  of  cases  from  each  user  and  that  of  synthesized  cases  are  called  as  A,  B, 
A+B,  respectively. 

Conceptual  difference  for  class  sysmbol  is  defined  as: 

Cl  :  different  symbols  are  used  to  denote  the  same  class 
C2  :  the  same  symbol  is  used  to  denote  the  different  classes 

These  are  discovered  based  on  the  “inconsistency  in  the  classification  knowledge” 
for  cases  in  A+B.  Cl  is  discovered  when  different  sysmbols  are  used  as  the  class 
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symbol  for  the  cases  with  the  same  value  for  each  attribute.  C2  is  discovered 
when  the  same  symbol  is  used  as  the  class  symbol  for  the  cases  with  the  different 
value  and/or  attribute. 

The  algorithms  for 

A1  :  different  symbols  are  used  to  denote  the  same  attribute 
A2  :  the  same  symbol  is  used  to  denote  the  different  attributes 
VI  :  different  symbols  are  used  to  denote  the  same  value 
V2  :  the  same  symbol  is  used  to  denote  the  different  values 

are  defined  similarly  based  on  the  structure  of  decision  tree  [7]. 

3  Discovery  Method  for  Conceptual  Differences  based  on 
Diverse  Structures 

3,1  Problems  in  Utilizing  IDS 

In  general  there  can  be  several  decision  trees  with  the  equivalent  classification 
for  a  set  of  cases,  however,  ID3  algorithm  constructs  the  most  simple  tree  in  the 
sense  that  classification  is  carried  out  as  fast  as  possible  at  the  upper  nodes  in 
a  decision  tree  by  reducing  the  redundant  information  held  in  the  cases.  Since 
the  time  complexity  of  IDS  algorithm  is  not  heavy,  it  is  suitable  for  interactive 
systems.  However,  reducing  redundant  information  held  in  a  set  of  cases  can¬ 
not  always  be  advantageous  for  the  aim  of  our  approach,  namely,  discovering 
conceptual  differences  among  people.  With  IDS  algorithm,  attributes  which  do 
not  contribute  to  classifying  cases  are  not  represented  in  the  decision  trees,  even 
if  they  are  utilized  in  the  representation  of  the  cases  to  denote  the  knowledge 
held  by  users.  Since  such  an  attribute  is  not  represented  in  the  decision  tree, 
it  is  impossible  to  discover  conceptual  difference  for  the  attribute  or  the  values 
for  it  by  comparing  the  decision  trees.  This  implies  that  sometimes  it  would  be 
impossible  to  discover  the  difference  in  concpet  as  the  difference  in  structure 
when  the  structure  is  constructed  by  IDS  algorithm. 


3.2  Constructing  Decision  Trees  with  Diverse  Structures 

Genetic  Algorithm  (GA)  is  utilized  in  our  approach  to  construct  decision  trees 
with  diverse  structure  so  that  effective  decision  trees  can  be  sought  in  search  by 
preserving  the  accuracy  in  the  classification  of  cases.  Hereafter,  in  decision  trees 
the  node  in  which  the  class  of  cases  is  determined  is  called  a  “leaf  node”,  and 
the  node  in  which  the  attribute  of  case  is  tested  is  called  a  “condition  node” . 

Coding,  Crossover  and  Mutation  We  employ  GA  with  tree  structure,  which 
is  often  used  in  Genetic  Programming,  for  constructing  decision  trees  [6,  1]  A 
node  in  a  decision  tree  is  represented  as  a  gene  in  the  coding  of  genetic  informa¬ 
tion.  The  gene  for  a  condition  node  contains  the  information  for  the  position  in 
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a  decision  tree  and  the  attribute  to  judge  the  branch  in  the  decision  tree.  The 
gene  for  a  branch  contains  the  value  for  the  attribute  above  the  branch,  which 
indicates  the  branch  to  follow  in  the  decision  tree. 

Crossover  is  carried  out  by  exchanging  partial  trees,  as  shown  in  Figure  2.  A 
partial  tree  is  defined  as  the  tree  which  has  a  condition  node  as  its  root.  Mutaion 


Fig.  2.  Crossover  by  exchanging  partial  trees 


is  carried  out  by  randomly  selecting  several  childen  which  are  constructed  by 
crossover  and  coping  them.  Arbitrary  nodes  in  these  copied  trees  are  specified  as 
crossover  points  and  the  partiall  trees  below  the  nodes  are  replaced  with  either 
other  partial  tree  or  a  leaf  node. 


Survival  Strategy  It  is  desirable  to  leave  the  set  of  decision  trees  for  the  next 
generation,  each  of  which  has  high  accureicy  for  the  classification  of  cases  and 
their  structures  are  as  different  as  possible.  The  following  two  criteria  are  utilized 
as  the  survival  strategy  in  GA  to  select  decision  trees  for  the  next  generation. 

(1)  Error  Rate 

“Error  rate”  is  defined  as  the  index  which 

—  decreases  when  cases  are  classified  into  one  class  at  shallower  depth 

—  increases  when  cases  are  not  classified  into  one  class  even  at  deeper  depth 

The  decision  tree  with  small  error  rate  has  better  capability  for  classification. 

Error  rate  is  calculated  as  shown  in  Fig  3.  First,  cases  are  classfied  by  a 
decision  tree  and  stored  into  the  leaf  nodes  in  the  tree  at  which  the  class  of 
case  is  determined.  Error  rate  for  a  leaf  node  is  determined  depending  on  the 
capability  of  the  classification  of  cases  along  the  path  from  the  root  to  the  leaf 
node  and  assigned  as: 

—  “0”  when  a  leaf  node  has  only  the  cases  with  the  same  class. 

“  “nc  -  1”  when  a  leaf  node  has  cases  with  multiple  classes  (#class  is  ric). 

—  “ATc”  (the  number  of  all  classes)  when  a  leaf  node  has  no  case. 

The  above  calculation  of  error  rate  treats  the  leaf  node  with  no  case  as  the  worst 
one  and  penalizes  a  leaf  node  more  severely  when  it  has  multiple  classes. 
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dep&  d  - 1 


#class  =  1  #class  =0 


#class= 

Fig.  3.  Calculation  of  error  rate 


As  for  a  condition  node,  all  the  error  rates  assigned  to  its  childlen  are  summed 
and  then  divided  by  the  number  of  branches,  which  is  number  of  values  for 
the  attribute  at  the  condition  node.  Taking  the  average  of  error  rates  below  a 
condition  node  contributes  to  alleviating  the  difference  in  error  rates,  which  will 
arise  due  to  the  difference  in  the  number  of  values  for  each  attribute.  The  average 
is  multiplied  by  “d  —  1”  (d  is  the  depth  of  the  condition  node  in  the  tree)  and 
assigned  as  the  error  rate  for  the  condition  node.  The  depth  of  root  in  a  tree  is 
treaed  as  1.  Utilizing  the  depth  of  node  as  multiplying  the  average  by  “d  -  1” 
contributes  to  reflecting  the  efficiency  of  classification. 

The  equations  for  calculating  error  rate  are  summarized  as  follows: 

-  Error  rate  at  leaf  nodes 

number  of  class  1  •  •  •  0 

number  of  class  nc(>  2)  •  •  -  ric  —  1 

number  of  class  0  Nc(>nc)  (the  number  of  all  classes) 

—  Equations  for  error  rate  at  condition  nodes 


E={d-l)x  Q2v=o  ^v)/nv  (d  ^  1) 

(i:Zo^v)/n,  (d-1) 

d:  depth  of  node  n^:  number  of  values  e„:  error  rate  for  each  child  node 

(2)  Mutual  Distance  between  Decision  Trees 

“Mutual  distance  between  decision  trees”  is  utilized  to  measure  the  degree 
of  divergence  in  structure  for  a  set  of  decision  trees.  It  is  calculated  by  summing 
up  the  distance  for  each  pair  of  decision  trees  in  the  set.  A  set  of  decision  trees 
with  larger  mutual  distance  has  more  divergence  in  structure  and  thus  has  the 
possibility  of  including  the  attribute  or  value  which  cannot  be  represented  in  a 
single  decision  tree. 

An  example  of  the  calculation  of  distance  between  decision  trees  is  shown  in 
Figure  4.  Suppose  the  number  of  all  the  decision  trees  is  Nt  and  the  number  of 
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decision  trees  which  are  left  for  the  next  generation  is  nt.  First,  a  set  of  decision 
trees  is  constructed  by  picking  up  rit  decision  trees  from  all  the  decision  trees. 
Then,  two  decision  trees  are  selected  from  the  set.  After  that,  vectors  of  size  Ua 
(which  is  the  number  of  attributes  in  these  trees)  are  constructed  for  each  tree. 
Vectors  are  prepared  for  each  depth  in  the  trees. 


veclla 

vecl2a 

vecl3si 


Cs^^^^Kstance  =2  + a  +  a 
Fig.  4.  Calculation  of  the  distance  between  decision  trees. 
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The  value  of  element  in  a  vector  is  set  to  “1”  when  the  attribute  for  the 
element  exists  at  the  condition  node  with  the  depth;  otherwise  it  is  set  to  “0”. 
Then,  the  vectors  with  the  same  depth  are  compared  and  the  number  of  elements 
with  different  value  is  counted.  Since  the  number  of  condition  nodes  grows  in 
accordeince  with  the  depth  in  general,  the  attribute  for  the  condition  node  with 
small  depth  is  considered  as  significant.  Thus,  the  result  of  counting  is  multiplied 
by  the  weight  which  is  in  inverse  proportion  to  the  depth  of  vector  to  reflect  the 
degree  of  significance  of  each  attribute.  The  above  operations  are  carried  out  for 
all  the  combination  ntC'z  of  each  pair  of  decision  trees  and  the  result  is  treated 
as  the  mutual  distance  for  a  set  of  decision  trees. 

Equation  for  the  mutual  distance  of  decision  trees 

rif  C2  D  Ua 

Dist.  =  \vecld,a  -  vec2d,a\} 

d=l  a=l 

nt’.  #decision  trees  in  one  generation  Ua’-  #attributes  in  decision  trees 
a(<  1):  weight  for  the  depth  of  node  D{=  Ua)’  maximum  depth  of  condition  node 


4  Experiments  and  Evaluations 

A  prototype  system  has  been  implemented  on  the  UNIX  workstation  with  C  lan¬ 
guage.  The  experiments  on  motor  diagnosis  cases  were  carried  out  to  evaluate 
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the  approach  in  this  paper.  In  experiments  two  persons  specified  their  knowledge 
in  the  form  of  100  cases  (as  shown  in  Fig  ??),  which  were  composed  of  six  at¬ 
tributes,  two  or  three  values  and  five  classes,  respectively.  Conceptual  differences 
are  artificially  encoded  into  the  cases  in  B  by  modifying  the  original  cases. 

Experiments  were  carried  out  in  the  condition  that  two  kinds  of  conceptual 
difference  occured  at  the  same  time  in  the  test  cases  to  see  the  interaction  and/ or 
interference  between  the  algorithms.  As  the  quantitative  ealuation,  the  number 
of  discovery  and  its  probability  of  discovery  up  to  the  third  candidate  were 
collected  in  the  experiments  both  for  the  system  with  IDS  and  that  with  GA. 
As  described  in  Section  2.2,  conceptual  difference  is  resolved  by  repeating  the 
interaction  between  the  suggestion  by  the  system  and  the  modification  of  cases 
by  the  user  in  our  approach,  however,  the  result  for  the  first  cycle  of  suggestion  by 
the  system  was  focused  on  in  the  experiments.  Summary  of  the  result  of  discovery 
is  shown  in  Table  4  and  Table  4.  The  result  shows  that  the  system  with  IDS  can 


Table  1.  Result  with  IDS. 


number 

of 

trials 

1st  2nd  3rd 

probability 

of 

discovery 

Cl 

20 

20 

0 

0 

100% 

C2 

18 

0 

0 

90% 

Cl 

30 

17 

1 

0 

60% 

A2 

5 

3 

7 

50% 

Cl 

30 

30 

0 

0 

100% 

V2 

52 

0 

0 

87% 

C2 

30 

22 

0 

1 

77% 

VI 

24 

4 

0 

93% 

Al 

30 

12 

13 

3 

93% 

A2 

6 

3 

7 

53% 

Al 

30 

12 

11 

7 

100% 

V2 

45 

2 

6 

88% 

Table  2.  Result  with  GA. 


number 

of 

trials 

1st  2nd  3rd 

probability 

of 

discovery 

Cl 

C2 

20 

20  0  0 
18  0  0 

100% 

90% 

Cl 

A2 

30 

30  0  0 

19  6  2 

100% 

90% 

Cl 

V2 

30 

30  0  0 
52  8  0 

100% 

100% 

C2 

VI 

30 

23  1  0 

30  0  0 

80% 

100% 

Al 

A2 

30 

14  10  4 
20  8  2 

93% 

100% 

Al 

V2 

30 

20  9  1 

38  22  0 

100% 

100% 

accurately  discover  conceptual  difference  for  Cl  and  Al.  However,  it  cannot 
discover  other  kinds  of  conceptual  difference  with  high  accuracy,  for  instance, 
the  probability  of  discovery  remains  at  50  %  for  A2.  It  is  noticed  that  conceptual 
difference  is  suggested  as  the  first  to  third  candidate.  On  the  other  hand,  the 
system  with  GA  can  discover  conceptual  difference  more  aecurately  in  general, 
and  conceptual  difference  is  suggested  as  the  higher  rank  in  candidates.  These 
results  show  that  the  structures  which  are  suitable  for  our  discovery  algorithms 
are  not  necessarily  represented  in  the  decision  trees  with  ID3.  Thus,  diverse 
structure  with  GA  can  be  said  to  contribute  to  improving  the  peformance  of 
discovery  of  conceptual  difference.  Suggesting  conceptual  difference  as  the  first 
candidate  will  also  contribute  to  reducing  the  possibility  of  suggesting  conceptual 
difference  erroneously.  Moreover,  utilizing  the  average  of  discovery  over  multiple 
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decision  trees  might  make  the  system  with  GA  more  robust  for  noise  due  to  the 
statistical  effect  of  averaging. 

The  experiments  show  that  utilizing  diverse  structures  with  GA  is  superior  to 
that  with  IDS  for  the  construction  of  decision  trees  with  respect  to  the  precision 
of  discovery  for  conceptual  difference.  On  the  other  hand,  with  respect  to  the 
computation  complexity,  IDS  takes  much  less  times  than  GA  and  thus  is  suitable 
for  the  interactive  system. 

5  Conclusion 

This  paper  has  proposed  an  evolutionary  approach  for  discovering  difference  in 
the  usage  of  words  to  facilitate  collaboration  among  people.  In  our  approach 
knowledge  of  users  is  structured  into  decision  trees  and  candidates  for  concep¬ 
tual  difference  are  suggested  based  on  the  structural  characteristics  of  decision 
trees.  By  pointing  out  the  problem  in  utilizing  deterministic  approach  for  the 
construction  of  decision  trees,  this  paper  has  proposed  to  carry  out  evolution 
with  respect  to  the  classification  efficiency  of  each  decision  tree  and  diversity 
as  a  population.  Experiments  were  carried  out  on  motor  diagnosis  cases  with 
artificially  encoded  conceptual  difference.  The  result  shows  that  our  approach  is 
effective  to  some  extent  as  the  first  step  for  dealing  with  the  issue  of  conceptual 
difference  toward  facilitating  collaboration  among  people. 
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Abstract.  In  AHP,  there  exists  the  problem  of  pair-wise  consistency 
where  evaluations  by  pair-wise  comparison  are  presented  with  crisp  value. 
We  propose  the  interval  AHP  model  with  interval  data  reflecting  Rough 
Set  concept.  The  proposed  models  are  formulated  for  analyzing  interval 
data  with  two  concepts  (necessity  and  possibility).  According  to  necessity 
and  possibility  concepts,  we  obtain  upper  and  lower  evaluation  models, 
respectively.  Furthermore,  even  if  crisp  data  in  AHP  are  given,  it  is  il¬ 
lustrated  that  crisp  data  should  be  transformed  into  interval  data  by 
using  the  trcinsitive  law.  Numerical  examples  are  shown  to  illustrate  the 
interval  AHP  models  reflecting  the  uncertainty  of  evaluations  in  nature. 
Key- word:  AHP,  Evaluation,  Rough  sets  concept.  Upper  and  lower 
models.  Intervals 


1  Introduction 


AHP  (Analytic  Hierarchy  Process)  proposed  by  T.L.Satty[l]  has  been  used  to 
evaluate  alternatives  in  multiple  criteria  decision  problems  under  a  hierarchical 
structure  and  has  frequently  been  applied  to  actual  decision  problems.  Satty  s 
AHP  method  is  based  on  comparing  n  objects  in  pairs  according  to  their  relative 
weights.  Let  us  denote  the  objects  by  Xi , . . .  ,  Xn  and  their  weights  by  , .  • . ,  Wn 
.  The  pair-wise  comparisons  can  be  denoted  as  the  following  matrix: 


/. 


^2 

V^n 


Xi  As  .  .  .  Xn  \ 


Wi  W\ 
Wi  W2 
Wi  Wi 
Wi  Wi 


Wn 

Wi 


Wn 

Wi 


which  satisfies  the  reciprocal  property  aji  =  .  If  the  matrix  A  satisfies  the 

cardinal  consistency  property  aijajk  —  ciik  >  A  is  called  consistent.  Generally,  A 
is  called  a  reciprocal  matrix. 

According  to  Satty’s  method,  we  have 


Aw  =  Xw 


376 


where  a  weight  vector  w  can  be  obtained  by  solving  the  above  eigenvalue  prob¬ 
lem. 

Now  suppose  that  the  pair-wise  comparison  ratios  are  given  by  intervals, 
although  they  are  real  numbers  in  the  conventional  AHP.  Intervals  scales  are 
estimated  by  an  individual  as  approximations.  In  AHP,  the  ratio  scale  for  pair¬ 
wise  comparisons  ranges  from  1  to  9  to  represent  judgment  entries  where  1  is 
equally  important  and  9  is  absolutely  more  important.  It  should  be  noted  that 
the  reciprocal  values  aji  —  are  always  satisfied.  As  an  example  of  interval 

ratios,  we  can  give  an  interval  =  [3,5]  and  then,  Aji  must  be  [|,|]*  AHP 
with  fuzzy  scales  has  been  studied  by  C.H  Cheng  and  D.L.  Mon[2]  where  fuzzy 
scales  are  transformed  into  ordinal  scales.  Considering  fuzziness  of  scales,  sensi¬ 
tivity  analysis  for  AHP  has  been  done  in  [3] . 

In  this  paper,  we  propose  an  interval  AHP  model,  given  interval  scales  as  pair¬ 
wise  comparison  ratios.  Dealing  with  interval  data,  we  can  obtain  the  upper  and 
lower  models  for  AHP  which  are  similar  to  Rough  Sets  Analysis[4].  Even  when 
crisp  data  are  given,  interval  data  can  be  obtained  from  crisp  data  by  using  the 
transitive  law.  Thus,  our  proposed  method  can  be  described  as  reflecting  the 
uncertainly  of  evaluations  in  nature.  Our  method  resorts  to  linear  programming 
so  that  interval  scales  can  easily  be  handled.  This  approach  to  uncertain  phe¬ 
nomena  has  been  used  in  regression  analysis  and  also  identification  of  possibility 
distributions[5] .  Numerical  examples  are  shown  to  illustrate  the  interval  AHP 
models. 
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the  problem  under  consideration  is  to  find  out  interval  weights  Wi  =  Wi] 
which  can  be  an  approximation  to  the  given  interval  matrix  [A]  of  (3)  in  some 
sense.  Since  we  deal  with  interval  data,  we  can  consider  two  approximations 
shown  in  Fig.l  as  follows.  The  lower  and  upper  approximations  should  satisfy 


Upper 


W‘'C[A]cw“ 

Fig.  1.  Upper  and  lower  approximations 


the  following  constrain  conditions 

C  [Aij]  {Lower Approximation) 
2  [Aij]  (Upper Approximation) 


(4) 

(5) 


where  W^j  and  Wfj  are  the  estimations  of  lower  and  upper  intervals.  (4)  and  (5) 
can  be  rewritten  a,s 


r  ,  Wi 

Wij  C  [Aij]  i — >  aij  <  izz  <  <  aij 


(6) 


rr  .  Wi  ^  Wi 

wfj  2  [^ij]  < — *  =  ^  ^ 


aJjWj  >  wj^  ^ 

iDv  wi 

Wj  - 

i — y  OjjWj  <  wij  ^ciij  >wi  (7) 

Now  let  us  consider  how  to  normalize  an  interval  vector  (Wi, . .  .,Wn)  ,  al¬ 
though  in  the  conventional  AHP  a  weight  vector  is  normalized  so  that  its  com¬ 
ponents  sum  to  unity.  The  conventional  normalization  can  be  extended  into  the 
interval  normalization [6]  defined  as  follows: 

An  interval  weight  vector  (PFi, . . . ,  Wn)  is  said  to  be  normalized  if  and  only  if 

wi  —  max  \Wi  —  Wj  ]  >  i  (8) 

(9) 


Wi  —  max  >  1 

*  ^ 

y~]  ^  +  max  (wj  —  tvj) 


<  1 


(8)  and  (9)  can  be  described  as 


Vi  ^  >  1  -  Yl 


(10) 


378 


Vi  Wj  <1-  VH  (11) 

where  Q  =  [1, . . . ,  n].  Using  the  concepts  of ’’Greatest  Lower  Bound”  and  ’’Least 
Upper  Bound”,  we  can  formulate  the  lower  model  and  the  upper  model,  respec¬ 
tively.  The  concept  of  two  approximations  is  similar  to  Rough  Set  concept. 

<  Lower  Model  > 

i 

subject  to 

V2,i(2^i)  '^XVj_>Wi 

Vi,i(2  7^i)  aijwj>wi 

Vi  >  1  ~  ^ 

ieo-{j} 

Vi  <  1  —  Yj  ^ 

Vz  Wj  <  Wj 

Vz  iU7  >  0  Vz' 

<  Upper  Model  > 

Min  Y^  “  l£i)  (1^) 

i 

subject  to 


V  i,j  (i  #  j) 

OijWj  <  Wi 

V  iJii  #i) 

wjaij  >  Wi 

Vi 

1  _  Y^  jJJT 

Vi 

wj<l- 

Vi 

Wi  <  Wi 

Vi 

Wi,wi  >  0 

Examplel: 

The  pair-wise  comparisons  matrix  is  given  as: 

/  1  [1,3]  [3,5]  [5,7]  [5,9]\ 

[|,1]  1  [i,2]  [1,5]  [1,4] 

[.1]=  [ii][i2]  1  [i3][2,4] 

1  [1,3] 


(14) 
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Using  the  lower  and  the  upper  models,  we  obtained  the  interval  weights  shown 
in  Table  1. 


Table  1  Interval  weights  obtained  by  two  models  ( Example  1) 


Alternatives  Lower  model 

Upper  model 

Wi 

[0.4225, 0.5343] 

[0.3333,0.3750] 

W2 

[0.1781,0.2817] 

[0.1250,0.3333] 

W3 

[0.1408,0.1408] 

[0.0417,0.2500] 

W4 

[0.0763,0.0845] 

[0.0536,0.1250] 

W5 

[0.0704,0.0704] 

[0.0417,0.1250] 

It  can  be  found  from  Table  1  that  the  interval  weights  obtained  by  the  lower 
model  satisfy  (6)  and  the  interval  weights  obtained  by  the  upper  model  satisfy 
(7)  .  The  obtained  interval  weights  can  be  said  to  be  normalized,  because  (8) 
and  (9)  hold. 


3  Interval  scales  by  transitivity 

If  a  consistent  matrix  A  is  given,  the  following  consistency  property  holds: 

aij  =  aikOkj  (15) 

However,  this  property  does  not  hold  in  general.  Therefore,  interval  scales  can 
be  obtained  by  transitivity  from  crisp  scales.  Denote  an  interval  scales  Aij  as 
Aij  =  [  aij,a^]  and  an  interval  matrix  [A]  as  [A]  =  [Aij]  .  Given  a  crisp  matrix 
A,  the  interval  matrix  [A]  can  be  obtained  as  follows: 

ajj  =  minv'  (aik  •  •  •  aij)  (16) 

aij  =  maxv  (a,fc  •  •  •  aij)  (17) 

where  U  is  a  set  of  all  possible  chains  from  i  to  j  without  any  loops. 


Example2: 

Let  us  start  with  a  crisp  scale  matrix  as  follows. 


Using  (16)  and  (17),  we  obtained  the  interval  matrix  [A]  . 


[1.5]  [i,6]  [1,6]  [|,18] 
1  [|.¥][|.¥]  [|.6] 
[^,|]  1  [|.¥]  [|.6] 
[^.f][^.f]  1  [|.3] 


(18) 


(19) 
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We  applied  the  interval  matrix  (19)  to  the  lower  and  the  upper  models  and 
obtained  the  interval  weights  shown  in  Table  2. 

Table  2  Interval  weights  obtained  by  two  models  (Example2) 


Alternatives  Lower  model 

Upper  model 

Wi 

[0.2972,0.5868] 

[0.2967,0.4723] 

W2 

[0.1528, 0.2547] 

[0.0945,0.2543] 

W3 

[0.0991,0.2547] 

[0.0787,0.2543] 

W4 

[0.1189,0.1274] 

[0.0787,0.1272] 

W5 

[0.0425, 0.0660] 

[0.0262,0.0675] 

Examples: 

Let  us  consider  the  binary  problem  shown  in  Fig. 2  where  ®  — )■  ©  means  that 

(i> — © 

i\T 

©  0 

Fig.  2.  Binary  Problem 


i  won  against  j.  It  is  assumed  that  the  value  of  2  is  assigned  to  wins  and  also  the 
value  of  I  is  assigned  to  defeats.  Then  we  obtain  the  matrix  A  with  unknown 
scales  denoted  as 


/I  2  2 |\ 

i  1  *  ♦ 
1*12 
\2*|  1/ 


Using  transitivity  (16)  and  (17),  we  have 


[A]  = 


I  1 


1  [il] 
[1,8]  1 
[1.4]  [^4] 


1 


\ 

/ 


(20) 


(21) 


We  applied  the  interval  matrix  (21)  to  the  lower  and  upper  models  and  obtained 
the  interval  weights  shown  in  Table  3. 
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Table  3  Interval  weights  obtained  by  two  models  (Example3) 


Alternatives 

Lower  model 

Upper  model 

Wi 

[0.2500,0.2500] 

[0.0909, 0.3636] 

W2 

[0.1250,0.1250] 

[0.0455,0.1818] 

W3 

[0.1250,0.3723] 

[0.0909, 0.3636] 

W4 

[0.2527,0.5000] 

[0.0909, 0.3636] 

4  Concluding  Remarks 

In  the  conventional  AHP,  pair-wise  comparisons  range  from  1  to  9  as  ration 
scale.  Therefore  the  scales  range  from  |  to  9.  If  we  use  transitivity  (16)  and  (17), 
the  upper  and  lower  bounds  of  interval  scales  obtained  by  (16)  and  (17)  may  be 
not  within  the  maximal  interval  [^,9].  Thus,  instead  of  (16)  and  (17),  we  can 
use 


rmn  f  {aik  -  •  aij)  (22) 

ajj  =  m^xf  {uik'-aij)  (23) 

where  the  function  /  (a:)  is  defined  by 

f| ;  for  X  which  is  less  than  - 
x;  for  X  which  is  within [|,  9]  (24) 

9;  for  X  which  is  larger  than  9 

Instead  of  the  function  /,  the  geometric  mean  can  be  used  to  obtain  interval 
scales. 
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Abstract.  Interval  density  functions  are  non-additive  probability  mea¬ 
sures  representing  sets  of  probability  density  functions.  Pawlak  proposed 
a  novel  approach  called  conflict  cinalysis  based  on  rough  set  theory.  In 
this  paper,  we  propose  a  new  approach  of  presenting  expert’s  knowledge 
with  interval  importances  and  apply  it  to  conflict  cinalysis.  It  is  assumed 
that  the  importance  degrees  are  given  for  representing  expert’s  knowl¬ 
edge.  Using  conditions  of  interval  density  functions,  we  represent  many 
experts’  knowledge  as  interval  importance  degrees.  A  simple  example  of 
the  new  introduced  concepts  is  presented. 

Keyword:  Interval  density  functions;  Decision  analysis;  Rough  sets;  Con¬ 
flict  aneilysis 


1  Introduction 

Interval  density  functions  (IDF)[1]  are  non-additive  probability  measures  repre¬ 
senting  sets  of  probability  density  functions.  An  interval  density  function  consists 
of  two  density  functions  by  extending  values  of  conventional  density  function  to 
interval  values,  which  do  not  satisfy  additivity. 

Conflict  analysis  plays  an  important  role  in  many  real  fields  such  as  busi¬ 
ness,  labor-management  negotiations,  military  operations,  etc.  The  mathemat¬ 
ical  models  of  conflict  situations  have  been  proposed  [2] [3]  and  investigated. 
Conflicts  are  one  of  the  most  characteristics  attributes  of  human  nature  and  a 
study  of  conflicts  is  important  theoretically  and  practically.  It  seems  that  fuzzy 
sets  and  rough  sets  [4]  are  suitable  candidates  for  modeling  conflict  situations 
under  the  presence  of  uncertainty. 

In  this  paper,  we  propose  a  new  approach  of  presenting  expert’s  knowledge 
with  interval  importances  and  apply  it  to  conflict  analysis.  It  is  assumed  that  an 
expert’s  knowledge  is  given  as  a  relative  importance  for  each  attribute.  When 
there  are  plural  experts,  their  knowledge  is  formulated  as  an  interval  importance 
using  interval  density  functions.  Then,  a  conflict  degree  between  two  agents  has 
an  interval  value. 
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2  Interval  Density  Functions 

In  this  section,  we  introduce  the  concept  of  interval  density  functions  [1].  Proba¬ 
bility  distributions  have  one  to  one  correspondence  with  their  density  functions. 
A  probability  density  function  d  :  X  ^  on  the  disjoint  finite  universe  X  is 
defined  as: 

Va:  €  X,  d(x)  >  0,  ^  d{x)  =  1. 

xex 

Then  the  probability  of  the  event  A  is  gives  as: 

VA  C  X,  P{A)  =  Y, 

xeA 

For  all  A,  B  C  A,  the  additivity  holds  as  follows: 

AnB  =  0  =>  P{AUB)  =  P(A)  +  P(B). 

Interval  density  functions  being  non-additive  probability  measures  are  defined 
as  follows: 

Definition  1  (A  interval  density  function  on  the  disjoint  finite  universe):  A 
pair  of  functions  (/i*,  h*)  satisfying  the  following  conditions  is  called  an  interval 
density  function  (IDF): 

h,,  h*  :  X  W  e  X,  h*{x')  >  h^x')  >  0, 

(I)  + 

x^X 

(II)  5^fe‘(a:)-(/«*(x')-A.(x'))>l. 

x£X 

The  conditions  (I)  and  (II)  can  be  transformed  as: 

(!’)  h^{x)  +  m^x[h*{x)  -  h^(x)]  <  1, 

xex 

(II’)  h*(x)  -  m3ix[h*{x)  -  /i^(3;)]  >  1. 

xex 

Then,  we  have  the  following  theorem. 

Theorem  1  For  any  IDF,  there  exists  a  probability  density  function  /i'(-)  satis¬ 
fying  that 

h,{x)<h'(x)<h*(x),  ^A'(x)  =  l. 

x^X 


To  illustrate  an  interval  density  function  let  us  consider  the  case  shown  in 
Fig.l  where  the  number  6  is  most  likely  occurred  comparatively  with  the  num¬ 
ber  1  to  5.  Interval  density  functions  for  the  number  1  to  5  are  (/i*,  h*)  = 
(1/10,  1/6),  and  interval  density  function  for  the  number  6  is  (/i+,  h*)  = 
(1/6,  1/2). 
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1  2  3  4  5  6 


Fig.  1.  Example  for  interval  density  functions 


It  is  clear  that  these  interval  density  functions  satisfy  Definition  1.  Taking 
the  number  6  for  x  , 


j2h4=^)  +  ih*ix>)-h,(x'))  =  U^<i 

x£X 


^  h*{x)  -  ih*{x')  -  h^x'))  =  I  -  I  >  1 
xex  ^  ^ 


Using  this  functions  h*),  we  can  define  two  distribution  functions  as 
follows: 

(Lower  boundary  function  (LB)  and  upper  boundary  function  (UB)  of  IDF):  For 
h'  satisfying  fi*(a:)  <  h*(x)  <  h*(x),  VA  C  X 


LB  (A)  =  min 


UB(A) 


nin  [X) '''(*))  ' 
\xeA  / 

lax/'x^'w)  • 

\x^A  / 


Then,  lower  and  upper  boundary  functions  have  the  following  properties. 

VA  C  X 


x£A 


x£A 


UB(A)  =  Y^h*(x)Ail-J2l^*i^ 


x£A 


x^A 
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And,  the  duality  of  LB  and  UB  holds. 

1  -  UB(A)  =  1  -  A  1 1  -  XI  '»*(*) 

x^A  \ 

=  5]/,,(x)V  [l-^/i-(x)j  =LB(A) 

xeA  \  x£A  / 

Importance  degrees  from  experts  will  be  formulated  as  interval  importance 
degrees  using  interval  density  functions  in  Section  4. 

3  Conflict  analysis 

In  this  section,  we  will  outline  about  conflict  analysis  from  Pawlak  [3].  In  a 
conflict,  at  least  two  parties,  called  agents,  are  in  dispute  over  some  issues.  The 
relationship  of  each  agent  to  a  speciflc  issue  can  be  clearly  represented  in  the 
form  of  a  table,  as  shown  in  Table  1.  This  table  is  taken  from  [3]. 


u 

a 

c 

d 

e 

T 

T 

T 

T 

T 

T 

2 

1 

0 

-1 

-1 

-1 

3 

1 

-1 

-1 

-1 

0 

4 

0 

-1 

-1 

0 

-1 

5 

1 

-1 

-1 

-1 

-1 

6 

0 

1 

-1 

0 

1 

Table  1.  Example  of  infomation  system 


Table  1  is  called  an  information  system  in  rough  sets  theory  [4].  The  ta¬ 
ble  rows  of  information  systems  are  labelled  by  objects,  the  table  columns  are 
labelled  by  attributes  and  the  entries  of  table  are  values  of  attributes,  which 
are  uniquely  assigned  to  each  object  and  each  attribute.  Then,  the  information 
system,  S,  is  given  as  (U,Q,  V)  where  U  is  the  set  of  objects,  Q  is  the  set  of 
attributes  and  V  is  the  set  of  attribute  values.  In  conflict  analysis,  a  conflict 
situation  is  represented  as  a  form  of  restricted  information  system.  Then,  ob¬ 
jects  correspond  to  agents  and  attributes  correspond  to  issues.  So,  in  Table  1, 
U  =  { 1, . . . ,  6}  is  a  set  of  agents  and  Q  =  {a, . . . ,  e}  is  a  set  of  issues.  And,  values 
of  attributes  are  represented  the  attitude  of  agents  to  issues:  1  means  that  the 
agent  is  favorable  to  the  issue,  —1  means  that  the  agent  is  against  the  issue  and 
0  means  neutral. 

In  order  to  express  the  relation  between  agents,  the  foilwing  auxiliary  function 
on  [3]  is  deflned  as 
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fl  if  q(x)q{y)  =  1  or  x  =  y 
0  if  q(x)q{y)  =  0  and  x^y  (1) 

-I  if  q{x)q(y)  = -I 


where  q{x)  is  the  attitude  of  the  agent  x  to  the  issue  q.  This  means  that,  if 
<l>q{x,y)  =  1,  the  agents  x  and  y  have  the  same  opinion  about  the  issue  q,  if 
(j)q{x,y)  —  0,  at  least  one  agent  has  neutral  approach  to  q  and  if  y)  =  -1, 
they  have  different  opinions  about  q. 

We  need  the  distance  between  x  and  y  to  evaluate  the  relation  between  x 
and  y.  Therefore  we  use  Pawlak’s  definition  as  follows: 


pqi^^y) 


^qeQ  y) 

\Q\ 


(2) 


where 


<t>*q{x,y) 


2 


{0  if  q(x)q(y)  =  1  or  x  ~  y 
0.5  if  q{x)q{y)  =  0  and  x  ^  y 
1  if  q\x)qly)  -I 


Applying  p*q{x^  y)  to  the  data  in  Table  1,  we  obtained  Table  2. 


(3) 


U  1  2  3  4  5  6 

1 

2  0.9 

3  0.9  0.2 

4  0.8  0.3  0.3 

5  1  0.10.10.2 

6  0.4  0.5  0.5  0.6  0.6 


Table  2.  Distance  functions  between  objects  in  Table  1 


4  Interval  importances  to  conflict  analysis 

In  this  section,  we  will  add  subjective  evaluations  for  issues  to  conflict  analysis. 
It  is  assumed  that  non-negative  relative  weights  are  given  for  all  issues.  Using  a 
non-negative  weight  w{q)  for  each  issue  q,  a  new  distance  function  pq  is  defined 
as  follows: 

qeQ _ 

qeQ 


PQ{x,y]  = 


(4) 
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where  '^g^Qw{q)  0.  Let  w'{q)  =  We  can  rewrite  pQ  under  the 

2-^qeQ 

normality  condition,  J2qeQ  =  1?  sis  follows: 

PQ{x,y)  =  Yi4)*{x,y)w'(q)  (5) 

geQ 

When  an  expert’s  knowledge  is  given  as  the  following  weights,  then  the  dis¬ 
tance  function  with  weights  is  calculated  in  Table  3  using  (4) . 


w 


[a)  =  0.2;  w(b)  =  0.8;  w{c)  =  0.5;  w(d)  =  1.0;  u;(e)  =  0,6. 


U  1  2  3  4  5  6 

2  0.87 

3  0.90  0.23 

4  0.81  0.32  0.29 

5  1.00  0.13  0.10  0.19 

6  0.35  0.52  0.55  0.65  0.65 


Table  3.  Distance  Functions  with  weights 


When  we  can  use  many  experts’  knowledge,  they  are  formulated  as  interval 
density  functions  [1]  as  shown  in  Section  2.  It  is  assumed  that  an  expert  gives 
normal  weights,  that  is,  the  sum  of  them  becomes  1.  When  there  are  plural  ex¬ 
perts,  then  the  following  functions  «;*)  becomes  an  interval  density  function. 

Proposition:  When  there  exist  plural  normal  weights,  Wi  (i  =  l,...,n),  over 
the  disjoint  space  Q,  two  functions  are  defined  £is 

w^g)  =  min  u),(?) 

iG{l,- -.n} 

w*(q)  =  max  Wi(q) 

*€{1, 

Then,  (w^,w*)  becomes  an  interval  density  function. 

Proof:  It  is  clear  that 

q^Q  q£Q 

holds.  If  there  exists  some  q'  £  Q  such  as  ^*(?)  +  ~  ^ 

then  there  is  no  set  of  normal  weights  which  w*{q')  belongs  to.  Therefore,  for 
all  q*  £  Q, 

Wt{q)  +  {w'iq')  -  w,(q'))  <  1 

9€Q 
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holds.  Similarly,  for  all  q'  ^ 

U)*  (q)  -  (w*  (g')  -W,(q'))>l 
?€<3 

holds.  Consequently,  (w,,w*)  becomes  an  interval  density  function.  Q.E.D. 
Using  functions  (w,,w*),  instead  of  pg  we  can  write  a  distance  function 

^<6*(a:,j/)«;,(g) 

Pq,  (*.  y)  =  -  (6) 

<jeQ 

and 

'^(l>*(x,y)w*(q) 

P’^^x,y)  =  ^^^  (7) 

2^  iv 

qeQ 

When  many  experts’  knowledge  are  given  as  Table  4,  then  the  distance  func¬ 
tion  with  weights  is  calculated  in  Table  5  using  (6)  and  (7). 


Q  importance 
a  [0.15,  0.30] 
b  [0.20,  0.35] 
c  [0.10,  0.25] 
d  [0.25,  0.40] 
e  [0.10,  0.20] 


Table  4.  Experts’  knowledge  with  interval  density  functions 


5  Conclusion 

A  new  approach  of  conflict  analysis  with  interval  importance  representing  ex¬ 
perts’  knowledge  is  proposed  under  the  assumption  that  an  expert’s  knowledge 
is  given  as  a  relative  importance  for  each  attribute.  Importance  degrees  from  ex¬ 
perts  are  formulated  as  interval  importance  degrees  using  interval  density  func¬ 
tions  and  then,  conflict  degrees  between  two  agents  are  obtained  as  an  interval 
value. 

The  presented  approach  for  conflict  analysis  depends  on  experts’  knowledges 
which  lead  to  interval  conflicts.  In  order  to  judge  some  relationship  between  two 
agents  as  one  of  conflict,  neutral,  and  alliance,  the  judgement  measure  proposed 
by  Pawlak  can  be  used. 
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_ 1 _ 2 _ 3 _ 4 _ 5  6 

1 

2  [0.875,  0.886] 

3  [0,933,  0.938]  [0.187,  0.188] 

4  [0.750,  0.753]  [0.353,  0.375]  [0.300,  0.313] 

5  [1,000,  1.000]  [0.120,  0.125]  [0.063,  0.067]  [0.233,  0.250] 

6  [0.375,  0.400]  [0.487,  0.500]  [0.533,  0.563]  [0.600,  0.625]  [0.600,  0.625] 


Table  5.  Distance  functions  by  intervcil  density  functions 
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Abstract.  As  the  end-user  computing  grows  up,  the  volume  of  infor¬ 
mation  defined  by  users  is  increasing.  Therefore,  incorporating  the  in¬ 
formation  defined  by  users  is  a  core  component  of  the  knowledge  man¬ 
agement.  In  this  paper,  the  author  proposes  a  method  for  incorporating 
personal  databases,  which  is  based  on  granular  computing  and  the  rela¬ 
tional  database  theory. 


1  Introduction 

As  the  end-user  computing  grows  up,  the  volume  of  information  defined  by  users 
is  increasing.  Using  the  databases  defined  by  users  is  very  convenient  for  our  daily 
work.  At  the  same  time  the  personal  databases  are  also  an  important  source  of 
knowledge  of  an  organization.  It  is  necessary  to  incorporate  personal  databases 
for  using  them  as  primary  knowledge  sources. 

A  possible  way  for  the  database  incorporation  is  the  relation  transformation 
based  on  the  normalization  theory  of  relational  databases[l].  However,  the  nor¬ 
malization  theory  focuses  on  the  formal  aspect  of  relational  databases  only.  To 
incorporate  the  personal  databases  defined  by  users,  a  method  that  reflects  the 
meaning  of  a  domain  is  required. 

Data  mining  based  on  granular  computing  is  essentially  a  “reverse”  engineer¬ 
ing  of  database  processing.  The  latter  organizes  and  stores  data  according  to  the 
given  semantics,  while  the  former  is  “discovering”  the  semantics  from  stored 
data[5].  This  assertion  suggests  that  data  mining  based  on  granular  computing 
is  an  efficient  way  for  incorporating  personal  databases. 

In  this  paper,  the  author  proposes  a  method  for  incorporating  personal 
data  resources.  This  method  is  based  on  granular  computing  and  the  relational 
database  theory.  At  first,  anomalies  on  personal  databases  are  discussed,  and 
then  the  proposed  incorporating  method  and  its  theoretical  background  are  de¬ 
scribed. 

2  Properties  of  Personal  Data 

At  the  first  step  of  our  study,  we  made  an  inquiry  about  personal  databases.  As 
a  result  of  this  inquiry,  the  following  derivations  can  be  found  in  the  databases 
defined  by  users. 


391 


-  Derivation  on  data  definition. 

In  Japanese  language  environment,  there  are  many  ways  to  express  some 
words  with  same  meaning.  However,  the  words  with  same  meaning  are  pro¬ 
cessed  as  different  words. 

-  Derivation  on  database  schema  definition. 

•  Derivation  on  attribute  definition. 

For  example,  when  defining  a  field  about  a  customer’s  name,  a  user  may 
use  two  attributes:  the  family  name  and  the  first  name,  but  the  other 
user  may  define  only  one  attribute:  the  name. 

•  Functional  dependencies. 

Functional  dependencies  define  a  relation  between  key  values  and  depen¬ 
dent  values.  There  may  be  various  key  definitions  on  same  relation.  Even 
though  same  relation  definition  is  provided  for  users,  each  user  probably 
defines  different  keys  on  the  relation. 

•  Relation. 

If  a  relation  on  a  database  satisfies  only  the  first  normal  form  in  the 
normalization  theory,  sometimes,  abnormal  results  are  obtained  by  an 
ordinary  SQL  operation.  These  abnormal  results  are  caused  by  struc¬ 
ture  of  relation  that  is  intermixed  several  different  relations.  Usually,  a 
translation  into  the  second  normal  form  is  considered  when  a  database 
application  developer  meets  these  abnormal  results. 

In  the  relational  database  theory,  the  normalization  theory  has  become  a  core 
one  for  database  normalization[l].  Many  personal  databases  do  not  clear  all  levels 
of  relational  database  normal  forms.  From  the  practical  view,  it  is  suflftcient  if 
every  personal  database  satisfies  the  third  normal  form  in  the  relational  database 
theory. 

Formal  translating  methods  from  a  first  normal  form  relation  to  the  second 
normal  form  relation  have  been  developed.  Most  methods  translated  relations 
based  on  their  syntactical  aspects,  but  the  main  task  that  translates  a  first 
normal  form  relation  has  semantic  issues.  When  we  translate  a  first  form  relation, 
we  should  pay  an  attention  to  its  semantic  aspects.  So  we  need  to  develop  a 
semantic  based  translation  method  for  translating  a  first  normal  form  relation 
into  a  second  normal  form  relation. 


3  Schema  Discovery  by  Granular  Computing 

3.1  Functional  Dependencies  and  Normalization 

In  this  section,  we  give  a  brief  overview  about  the  functional  dependencies  and 
the  relational  database  normalization  theory  according  to  the  literature [2]. 

Let  sets  X  and  Y  be  two  attribute  sets  of  a  relation  R{Ai ,  j42,  . . . ,  An),  where 
X\JY  =  Or  and  X  f)Y  =  Z{^  where  Or  is  the  set  of  all  attributes  on  a 
relation  R.  The  functional  dependency  X  —^Y  is  defined  as  the  definition  1. 


392 


Definition  1.  We  say  the  functional  dependency  from  X  to  Y  in  exits,  if  the 
condition  (1)  is  satisfied  in  all  instances  r  of  R, 

€  nmx]  =  t'[X]  ^  t\Y]  =  t'[Y])  (1) 

The  multi-valued  dependency  X  — Y  is  a  generalization  of  the  functional 
dependency  defined  as  Definition  2. 

Definition  2.  X  and  Y  are  an  attribute  on  a  relation  R.  We  say  X  decides  Y  in 
multi- value  or  Y  depends  on  X  in  multi- value  when  the  condition  2  is  satisfied 
all  instances  r  in  R. 


{Mt.f  e  i2)(f  [X]  =  t'lX]  t[X  U  Y]  =  t'[Z])  G  R  (2) 

A{t'[XUYlt[Z]eR) 

where  Z  =  Qr  —  {X  —  Y). 

The  definition  says  that  the  new  tuples  (f[X  U  Y],f[Z])  and  {t'[X  U  Y],<[^]) 
that  are  made  from  tuples  t  and  t'  are  also  tuples  in  the  R  where  t  and  t'  satisfy 
the  condition  t[X]  =  t'[X]. 

We  say  two  projections  i2[X]  and  R\Y]  on  the  relation  R  is  information  loss¬ 
less  decomposition  if  R  =  R[X]  x  i2[Y].  The  necessary  and  sufficient  condition  for 
information  lossless  decomposition  is  guaranteed  by  the  following  propositions 
and  theorem. 

Propositions.  If  X  andY  are  information  lossless  decomposition  of  R,  RQ 
R[X]  M  R[Y], 

Proposition  4.  The  necessary  and  sufficient  condition  for  R  C  R[X]  M  i2[Y]  is 
the  multi-valued  dependency  X  DY  — X\Y  that  is  satisfied  on  R. 

By  the  proposition  3  and  4,  the  following  theorem  can  be  obtained. 

Theorems.  The  necessary  and  sufficient  condition  for  R  C  i2[A']  M  i2[Y]  is 
the  multi-valued  dependency  X  I\Y  X\Y  is  satisfied  on  R. 


3.2  Granular  Computing 

We  introduce  the  notations  and  theorems  on  granular  computing  followed  by 
Lin [5].  An  equivalence  relation  divides  the  universe  into  disjoint  elementary  sets. 
A  binary  relation  decomposes  a  universe  into  elementary  neighborhoods  that  are 
not  necessarily  disjoint.  The  decomposition  is  called  a  binary  granulation  and 
the  collection  of  the  granules  a  binary  neighborhood  system[3,  4]. 

Let  (U;  B*;  C*,  i  =  0, 1, 2, . . .)  be  a  collection  of  granular  structure  where  U  is 
a  set  of  entities  or  an  ATS-space  imposed  by  B,  B*  is  elementary  neighborhoods 
and  C*  is  elementary  concepts  (=  NAME{Bj)  and  each  C*  is  an  iV5-space.  The 
relationships  among  attribute  values  of  a  table  are  defined  by  the  elementary  sets 
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of  a  multiple  partition.  Inclusion  of  an  elementary  set  in  another  elementary  set  is 
an  inference  on  the  corresponding  elementary  concepts.  A  functional  dependency 
of  two  columns  is  a  refinement  of  two  corresponding  partitions. 

On  the  granular  structure,  the  following  rules  are  defined [5]. 

1.  Continuous  inference  rules: 

A  formula  Cj  — ►  /);i  is  a  continuous  inference  rule,  if  NEIGH{Pj)  C 
NEIGH{Qh).  Here,  NEIGH{Pj)  means  a  neighborhoods  of  Pj. 

2.  Softly  robust  continuous  inference  rules:  A  formula  Cj  — >  is  a  softly  con¬ 

tinuous  inference  rule,  if  NEIGH{Pj)  C  NEIGH{Qh)  and  \NEIGH{Pj)n 
NEIGH{Qh)\>  threshold. 

3.  High  level  continuous  inference  rules:  Suppose  PN  =  B\QN  =  QN  = 

,QN  =  ,  and  j  ^  i  are  two  nested  granular  structures,  that  is,  PN'  -< 

Pi\^(*‘+*)  and  QN'  -<  Write  P  =  PN^  and  Q  =  QN^,  where 

m>m  and  k  >  0..  A  formula  Cj  — ►  is  a  high  level  continuous  inference 

rule,  if  NEIGH{Pj)  C  NEIGH{Qh)  and  \NEIGH{Pj)  n  NEIGH{Qh)\  > 
threshold. 

The  above  rules  can  be  regarded  as  the  generalization  for  the  definition  about 
functional  dependencies  in  relational  database.  Furthermore,  we  can  extend  these 
concepts  to  the  method  for  discovering  functional  dependencies. 

4  Incorporating  Personal  Databases 

The  main  procedure  of  incorporating  personal  databases  is  described  as  follows. 

1.  Data  normalization. 

The  data  normalization  procedure  is  consist  of  following  two  sub-procedures. 

(a)  Continuous  data  quantization. 

(b)  Word  correction. 

The  first  sub-procedure  is  the  continuous  data  quantization  if  a  data  set  is 
continuous  values.  The  second  sub-procedure  is  the  data  correction.  For  the 
word  correction,  we  use  a  simple  structured  thesaurus.  In  the  thesaurus,  the 
following  Japanese  specific  word  correct  relation  are  stored. 

“  Conversion  rules  between  han-kaku  kana  and  zen-kaJcu  kana. 

—  Special  characters  with  same  meaning. 

—  Rules  about  okuri-gana. 

According  to  these  rules,  different  characters  with  same  meaning  is  corrected 
automatically. 

2.  Obtaining  elementary  concepts  and  elementary  neighborhoods  for  each  at¬ 
tribute  on  relations. 

3.  Detection  of  originally  identical  attributes. 

If  \C'  nC^\  >  threshold^  the  attributes  i  and  j  seem  to  be  identical  at¬ 
tributes  where  C'  is  the  elementary  concepts  of  the  attribute  i,  and  C^  is 
the  elementary  concepts  of  the  attribute  j. 
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4.  Detection  of  functional  dependencies. 

Functional  dependencies  in  a  relation  are  found  according  to  the  following 

procedure. 

(a)  For  each  attribute  in  a  relation,  level  of  the  derivable  rule  on  the  granular 
structure  is  determined. 

(b)  According  to  the  determined  level,  the  following  schema  transformations 
are  possible. 

—  If  the  continuous  inference  rule  is  established  between  C*  and  ,  a 
functional  dependency  between  attributes  i  and  j  is  established. 

—  If  the  continuous  inference  rule  is  established  from  C*  to  and 
at  the  same  time,  the  attributes  j  and  k  seem  to  be  the  attributes 
in  which  a  functional  dependency  between  attributes  i  and  j  H  k  is 
established. 

—  If  the  softly  robust  continuous  inference  rules  is  established  between 
C*  and  ,  a  multi-valued  dependency  between  attributes  i  and  j  is 
established.  Moreover,  the  attribute  j  can  be  another  relation.  If  the 
attribute  j  is  decomposed  into  another  relation,  this  decomposition  is 
the  information  lossless  decomposition.  It  is  evident  from  Definition  2 
and  the  properties  of  the  rules  on  the  granular  structure. 

5  Conclusion 

In  this  paper,  a  method  for  incorporating  personal  databases  was  proposed.  We 
described  that  incorporating  information  defined  by  users  is  a  core  component  of 
the  knowledge  management.  Some  kinds  of  deviations  of  data  and  schema  were 
argued.  Another  type  of  data  deviation  that  was  not  argued  in  this  paper  is  how 
to  handle  null  values.  How  to  handle  null  values  is  depend  on  different  users.  We 
are  also  developing  a  more  sophisticated  method  for  handling  this  issue. 
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Abstract.  Knowledge  representation  which  is  internal  to  a  computer 
lacks  empirical  meaning  so  that  it  is  insufficient  for  the  investigation  of 
the  external  world.  All  intelligent  systems,  including  robot-discoverers 
must  interact  with  the  physical  world  in  complex,  yet  purposeful  and 
accurate  ways.  We  argue  that  operational  definitions  are  necessary  to 
provide  empirical  meaning  of  concepts,  but  they  have  been  largely  ig¬ 
nored  by  the  research  on  automation  of  discovery  and  in  AI.  Individual 
operational  definitions  can  be  viewed  as  algorithms  that  operate  in  the 
real  world.  We  explain  why  many  operational  definitions  are  needed  for 
each  concept  and  how  different  operational  definitions  of  the  same  con¬ 
cept  can  be  empirically  and  theoretically  equivalent.  We  argue  that  all 
operational  definitions  of  the  same  concept  must  form  a  coherent  set 
and  we  define  the  meaning  of  coherence.  No  set  of  operational  defini¬ 
tions  is  complete  so  that  expanding  the  operational  definitions  is  one  of 
the  key  tasks  in  science.  Among  many  possible  expansions,  only  a  very 
special  few  lead  to  a  satisfactory  growth  of  scientific  knowledge.  While 
our  examples  come  from  natural  sciences,  where  the  use  of  operational 
definitions  is  especially  clear,  operational  definitions  are  needed  for  all 
empirical  concepts.  We  briefly  argue  their  role  in  database  applications. 


1  Operational  definitions  provide  empirical  meaning 

Data  about  external  world  are  obtained  by  observation  and  experiment.  Sophis¬ 
ticated  procedures  and  instruments  are  commonly  used  to  reach  data  of  sci¬ 
entific  value.  Yet  we  rarely  think  systematically  about  methods  by  which  data 
have  been  procured,  until  problems  occur.  When  a  set  of  data  is  inconsistent 
with  our  expectations,  we  start  asking:  “How  was  this  particular  measurement 
obtained?”,  ‘What  method  has  been  used?”,  “How  is  this  method  justified?”. 
Often  it  turns  out  that  a  method  must  be  changed.  Because  data  can  be  wrong 
in  so  many  ways,  sophisticated  knowledge  is  required  in  order  to  examine  and 
improve  measurement  methods. 

It  is  critical  to  the  growth  of  scientific  knowledge  to  study  new  situations, 
for  which  no  known  method  can  measure  a  particular  quantity.  For  instance, 
we  may  wish  to  measure  temperatures  lower  than  the  capabilities  of  all  existing 
instruments.  Or  we  want  to  measure  temperature  change  inside  a  living  cell,  as 
the  cell  undergoes  a  specific  process. 
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When  no  known  method  applies,  new  methods  must  be  discovered.  New 
measurement  methods  must  expand  the  existing  concepts.  For  instance,  a  new 
thermometer  must  produce  measurements  on  a  publicly  shared  scale  of  temper¬ 
ature. 

Discovery  of  new  measurement  methods,  which  we  also  call  operational  def¬ 
initions,  is  the  central  problem  in  this  paper.  We  provide  an  algorithm  that 
demonstrates  how  empirical  knowledge  is  used  to  construct  new  operational  def¬ 
initions,  how  new  methods  can  be  empirically  verified  and  how  choices  can  be 
made  among  competing  methods. 

We  end  each  section  with  a  few  basic  claims  about  operational  definitions. 

Claim  1:  For  each  empirical  concept,  measurements  must  be  obtained  by  repeat- 
able  methods  that  can  be  explained  in  detail  and  used  in  different  laboratories. 

Claim  2:  The  actual  verification  in  empirical  science  is  limited  to  empirical 
facts.  Operational  definitions  determine  facts;  thus  they  determine  the  scope  of 
scientific  verification. 

Claim  3:  In  contrast,  scientific  theories  often  make  claims  beyond  the  facts  that 
can  be  empirically  verified  at  a  given  time.  Theoretical  claims  often  apply  to  all 
physical  situations,  whether  we  can  observe  them  or  not. 

In  this  paper  we  use  examples  of  numerical  properties  of  objects  and  their 
pairs.  The  numbers  that  result  from  measurements,  for  instance  temperature  or 
distance,  we  call  values  of  empirical  concepts. 

Claim  4:  Operational  definitions  can  be  classified  in  several  dimensions:  (a)  they 
apply  to  objects,  states,  events,  locations  and  other  empirical  entities;  (b)  they 
may  define  predicates  of  different  arity,  for  instance,  properties  of  individual 
objects,  object  pairs  (distance)  or  triples  (chemical  affinity);  (c)  some  opera¬ 
tional  definitions  provide  data  while  others  prepare  states  that  possess  specific 
properties,  such  as  the  triple  point  of  water. 

2  The  AI  research  has  neglected  operational  definitions 

Operational  semantics  links  the  terms  used  in  scientific  theories  with  direct  ob¬ 
servations  and  manipulations  (Bridgman,  1927;  Carnap,  1936).  While  important 
in  empirical  science,  the  mechanisms  that  produce  high  quality  experiments  have 
been  neglected  not  only  in  the  existing  discovery  systems  but  in  the  entire  do¬ 
main  of  artificial  intelligence. 

The  distinction  between  formalism  and  its  interpretation,  also  called  seman¬ 
tics,  has  been  applied  to  the  study  of  science  since  1920’s  and  1930’s.  Scientific 
theories  have  been  analyzed  as  formal  systems  whose  language  is  empirically 
interpreted  by  operational  definitions. 

A  similar  distinction  applies  to  discovery  systems  and  to  knowledge  they 
create.  A  discovery  mechanism  such  as  BACON  (Langley,  Simon,  Bradshaw  & 
Zytkow,  1987)  can  be  treated  as  (1)  a  formal  system  that  builds  equations  from 
data  that  are  formally  tuples  in  the  space  of  the  values  of  independent  and 
dependent  variables  plus  (2)  a  mechanism  that  procures  data. 
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Similarly  to  scientists,  BACON  and  other  discovery  systems  use  plans  to  pro¬ 
pose  experiments.  Each  experiment  consists  in  selecting  a  list  of  values  xi, Xk 
of  empirical  variables  Xij-.jA'jt,  and  in  obtaining  the  value  j/  of  a  dependent 
variable  Y  which  provides  the  ”  world  response”  to  the  empirical  situation  char¬ 
acterized  by  xi,  ...jXk-  But  instead  of  real  experiments,  the  values  of  dependent 
variables  are  either  typed  by  the  user  or  computed  in  simulation,  in  response  to 
the  list  of  values  of  independent  variables. 

This  treatment  bypasses  real  experimentation  and  measurements.  Other  pa¬ 
pers  and  collections  that  consider  many  components  of  the  scientific  methods 
(Kulkarni  &  Simon,  1987;  Sleeman,  Stacey,  Edwards  &  Gray,  1989;  Shrager 
Langley,  1990;  Valdes-Perez,  1995)  neglect  operational  definitions  of  concepts. 

In  the  wake  of  robotic  discovery  systems,  operational  semantics  must,  at  the 
minimum,  provide  realistic  methods  to  acquire  data.  Zytkow,  Zhu  &;  Hussam 
(1990)  used  a  robotic  mechanisms  which  conducted  automatically  experiments 
under  the  control  of  FAHRENHEIT.  In  another  robotic  experiment,  Zytkow,  Zhu 

Zembowicz  (1992)  used  a  discovery  process  to  refine  an  operational  defini¬ 
tion  of  mass  transfer.  Huang  &  Zytkow  (1997)  developed  a  robotic  system  that 
repeats  Galileo’s  experiment  with  objects  rolling  down  an  inclined  plane.  One 
operational  definition  controlled  the  robot  arm  so  that  it  deposited  a  cylinder 
on  the  top  of  an  inclined  plane,  while  another  measured  the  time  in  which  the 
cylinder  rolled  to  the  bottom  of  the  plane. 

While  operational  semantics  must  accompany  any  formalism  that  applies  to 
the  real  world,  it  has  been  unnoticed  in  AI.  Jackson’s  claim  (1990)  is  typical: 
“a  well-defined  semantics  . . .  reveals  the  meaning  of  . . .  expressions  by  virtue  of 
their  form.”  But  this  simply  passes  on  the  same  problem  to  a  broader  formalism, 
that  includes  all  the  terms  used  in  formal  semantics.  Those  terms  edso  require 
real-world  interpretation  that  must  be  provided  by  operational  definitions. 

Plenty  of  further  research  must  be  conducted  to  capture  the  mechanisms  in 
which  operational  definitions  are  used  in  science  and  to  make  them  applicable 
on  intelligent  robots. 

Claim  5:  Formal  semantics  are  insufficient  to  provide  empirical  meaning. 
Claim  6:  Robotic  discoverers  must  be  equipped  in  operational  definitions. 


3  Operational  definitions  interact  with  the  real  world 

Early  analyses  of  operational  definitions  used  the  language  of  logic.  For  instance, 
a  dispositional  property  “soluble  in  water”  has  been  defined  as 

If  X  is  in  water  then  (x  is  soluble  in  water  if  and  only  if  x  dissolves) 

But  a  more  adequate  account  is  algorithmic  rather  than  descriptive: 

Soluble  (x) 

Put  X  in  water! 

Does  X  dissolve? 
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As  an  algorithm,  operational  definition  consists  of  instructions  that  prescribe 
manipulations,  measurements  and  computations  on  the  results  of  measurements. 
Iteration  can  enforce  the  requirements  such  as  temperature  stability,  which  can 
be  preconditions  for  measurements.  Iteration  can  be  also  used  in  making  mea¬ 
surements.  The  loop  exit  condition  such  as  the  equilibrium  of  the  balance,  or 
a  coincidence  of  a  mark  on  a  measuring  rod  with  a  given  object,  triggers  the 
completion  of  a  step  in  the  measurement  process. 

Procedures  that  interpret  independent  and  dependent  variables  can  be  con¬ 
trasted  as  manipulation  and  measurement  mechanisms.  Each  independent  vari¬ 
able  requires  a  manipulation  mechanism  which  sets  it  to  a  specific  value,  while 
a  response  value  of  an  dependent  variable  is  obtained  by  a  meeisurement  mech¬ 
anism.  In  this  paper  we  focus  on  measurement  procedures. 

It  happens  that  an  instruction  within  procedure  P  does  not  work  in  a  spe¬ 
cific  situation.  In  those  cases  P  cannot  be  used.  Each  procedure  may  fail  for 
many  reasons.  Some  of  these  reasons  may  be  systematic.  For  instance,  a  given 
thermometer  cannot  measure  temperatures  below  -40C  because  the  thermomet¬ 
ric  liquid  freezes  or  above  certain  temperature,  when  it  boils.  Let  us  name  the 
range  of  physical  situations  to  which  P  applies  by  Rp . 

Often,  a  property  is  measured  indirectly.  Consider  distance  measurement 
by  sonar  or  laser.  The  time  interval  is  measured  between  the  emitted  and  the 
returned  signal.  Then  the  distance  is  calculated  as  a  product  of  time  and  velocity. 
Let  C{x)  be  the  quantity  measured  by  procedure  P.  When  P  terminates,  the 
returned  value  of  C  is  /(mi, ...,  mjk),  where  mi, ...,  m*  are  the  values  of  different 
quantities  of  x  or  the  empirical  situation  around  x,  measured  or  generated  by 
instructions  within  P,  and  /  is  a  computable  function  on  those  values. 

Claim  7:  Each  operational  definition  should  be  treated  as  an  algorithm. 

Claim  8:  The  range  of  each  procedure  P  is  limited  in  many  ways,  thus  each  is 
merely  a  partial  definition  applicable  in  the  range  Rp. 

Claim  9:  An  operational  definition  of  concept  C  can  measure  different  quantities 
and  use  empirical  laws  to  determine  the  value  of  C\  C{x)  =  /(mi, ...,  m*) 

Claim  10:  An  operational  definition  of  a  concept  C{x)  can  be  represented  by  a 
descriptive  statement;  “If  x  is  in  Rp  then  C{x)  —  /(mi, ...,  mjfe)” 

4  Each  concept  requires  many  operational  definitions 

In  everyday  situations  distance  can  be  measured  by  a  yard-stick  or  a  tape.  But 
a  triangulation  method  may  be  needed  for  objects  divided  by  a  river.  It  can  be 
extended  to  distance  measurement  from  the  Earth  to  the  Sun  and  the  Moon. 
Then,  after  we  have  measured  the  diameter  of  the  Earth  orbit  around  the  Sun, 
we  can  use  triangulation  to  measure  distances  to  many  stars. 

But  there  are  stars  for  which  the  difference  between  the  “winter  angle”  and 
the  “summer  angle”  measured  on  the  Earth,  is  non-measurably  small,  so  another 
method  of  distance  measurement  is  needed.  Cefeids  are  some  of  the  stars  within 
the  range  of  triangulation.  They  pulsate  and  their  maximum  brightness  varies 
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according  to  the  logarithm  of  periodicity.  Another  law,  determined  on  Earth  and 
applied  to  stars  claims  that  the  perceived  brightness  of  a  constant  light  source 
diminishes  with  distance  as  1/d^.  This  law  jointly  with  the  law  for  cefeids  allows 
us  to  determine  the  distance  to  galaxies  in  which  individual  cefeids  are  visible. 

For  such  galaxies  the  Hubble  Law  was  empirically  discovered.  It  claims  pro¬ 
portionality  between  the  distance  and  red  shift  in  the  lines  of  hydrogen  spectrum. 
The  Hubble  Law  is  used  to  determine  the  distance  of  the  galaxies  so  distant  that 
cefeids  cannot  be  distinguished. 

Similarly,  while  a  gas  thermometer  applies  to  a  large  range  of  states,  in  very 
low  temperatures  any  gas  freezes  or  gas  pressure  becomes  non-measurably  small. 
A  thermometer  applied  in  those  situations  measures  magnetic  susceptibility  of 
paramagnetic  salts  and  uses  Curie- Weiss  Law  to  compute  temperature.  There  are 
high  temperatures  in  which  no  vessel  can  hold  a  gas,  or  states  in  which  the  inertia 
of  gas  thermometer  has  unacceptable  influence  on  the  measured  temperature. 
Measurements  of  thermal  radiation  and  other  methods  can  be  used  in  such  cases. 

Claim  11:  Empirical  meaning  of  a  concept  is  defined  by  a  set  of  operational 
definitions. 

Claim  12:  Each  concrete  set  is  limited  and  new  methods  must  be  constructed 
for  objects  beyond  those  limits. 

5  Methods  should  be  linked  by  equivalence 

Consider  two  operational  definitions  Pi  and  P2  that  measure  the  same  quantity 
C.  When  applied  to  the  same  objects  their  results  should  be  empirically  equiva¬ 
lent  within  the  accuracy  of  measurement.  If  Pi  and  P2  provide  different  results, 
one  or  both  must  be  adjusted  until  the  empirical  equivalence  is  regained. 

From  the  antiquity  it  has  been  known  that  triangulation  provides  the  same 
results,  within  the  limits  of  measurement  error,  as  a  direct  use  of  measuring  rod 
or  tape.  But  in  addition  to  the  empirical  study  of  equivalence,  procedures  can 
be  compared  with  the  use  of  empirical  theories  and  equality  of  their  results  may 
be  proven. 

Tri angulation  uses  a  basic  theorem  of  Euclidean  geometry  that  justifies  the¬ 
oretically  the  consistency  of  two  methods:  by  the  use  of  yard-stick  and  by  tri¬ 
angulation.  To  the  extent  in  which  Euclidean  geometry  is  valid  in  the  physical 
world,  whenever  we  make  two  measurements  of  the  same  distance,  one  using  a 
tape  while  the  other  using  triangulation,  the  results  are  consistent. 

Claim  13:  Methods  can  differ  by  their  accuracy  and  by  degree  to  which  they 
influence  the  measured  quantity. 

Claim  14:  When  two  operational  definitions  define  the  same  property  and  apply 
to  the  same  objects,  their  results  should  be  empirically  equivalent.  If  they  are 
not,  additional  data  are  collected  and  methods  are  adjusted  in  order  to  restore 
their  equivalence. 

Claim  15:  When  two  operational  definitions  define  the  same  concept  C{x)^  it  is 
possible  to  prove  their  equivalence.  The  prove  consists  in  deducing  from  a  verified 
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empirical  theory  that  the  statements  that  represent  them  are  equivalent,  that  is, 
/i(mi ,.„,mk)  =  /2(ni , n/) 

Claim  16:  When  the  statements  that  represent  two  procedures  use  empirical 
laws  C{x)  =  C{x)  =  /2(ni,  theoretical  equivalence  of  both 

procedures  follows  from  those  laws. 

Claim  17:  The  more  general  and  better  verified  are  the  theories  that  justify  the 
equivalence  of  two  procedures  Pi  and  P2,  the  stronger  are  our  reasons  to  believe 
in  the  equivalence  of  Pi  and  P2. 

Claim  18:  Proving  the  equivalence  of  two  procedures  is  desired,  because  the 
empirical  verification  of  equivalence  is  limited. 

6  Operational  definitions  of  a  concept  form  a  coherent  set 

We  have  considered  several  procedures  that  measure  distance.  But  distance  can 
be  measured  in  many  other  ways.  Even  the  same  method,  when  applied  in  dif¬ 
ferent  laboratories,  varies  in  details.  How  can  we  determine  that  different  mea¬ 
surements  define  the  same  physical  concept?  Procedures  can  be  coordinated  by 
the  requirements  of  empirical  and  theoretical  equivalence  in  the  areas  of  common 
application.  However,  we  must  also  require  that  each  method  overlaps  with  some 
other  methods  and  further,  that  each  two  methods  are  connected  by  a  chain  of 
overlapping  methods. 

Definition:  A  set  <P  =  of  operational  definitions  is  coherent  iff  for 

each  i,  j  =  l,...,n 

(1)  <j)i  is  empirically  equivalent  with  Notice  that  this  condition  is  trivially 
satisfied  when  the  ranges  of  both  operational  definitions  do  not  overlap; 

(2)  there  is  a  sequence  of  definitions  such  that  (f^ii  =  <l>i,  <j>-ik  = 

(j>j,  and  for  each  m  =  2, ...,  k  the  ranges  of  <i>~im  and  intersect. 

The  measurements  of  distance  in  our  examples  form  such  a  coherent  set.  Rod 
measurements  overlap  with  measurements  by  triangulation.  Different  versions 
of  triangulation  overlap  with  one  another.  The  triangulation  applied  to  stars 
overlaps  with  the  method  that  uses  cefeids,  which  in  turn  overlaps  with  the 
method  that  uses  Hubble  Law. 

Similarly,  the  measurements  with  gas  thermometer  have  been  used  to  cali¬ 
brate  the  alcohol  and  mercury  thermometers  in  their  areas  of  joint  application. 
For  high  temperatures,  measurements  based  on  the  Planck  Law  of  black  body 
radiation  overlap  with  the  measurements  based  on  gas  thermometers.  For  very 
low  temperatures,  the  measurements  based  on  magnetic  susceptibility  of  para¬ 
magnetic  salts  overlap  with  measurements  with  the  use  of  gas  thermometer . 

Claim  19:  Each  empirical  concept  should  be  defined  by  a  coherent  set  of  op¬ 
erational  definitions.  When  the  coherence  is  missing,  the  discovery  of  a  missing 
link  becomes  a  challenge. 

For  instance,  the  experiment  of  Millikan  provided  a  link  between  the  charge 
of  electron  and  electric  charges  measured  by  macroscopic  methods. 
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Claim  20:  By  examining  theoretical  equivalence  in  a  coherent  set  0  of  opera¬ 
tional  definitions  we  can  demonstrate  that  the  values  measured  by  all  procedure 
in  0  are  on  the  same  scale. 

Claim  21:  Operational  definitions  provide  means  to  expand  to  new  areas  the 
range  of  the  laws  they  use. 


7  Laws  can  be  used  to  form  new  operational  definitions 


Operational  definitions  can  expand  each  concept  in  several  obvious  directions, 
towards  smaller  values,  larger  values,  and  values  that  are  more  precise.  But  the 
directions  are  far  more  numerous.  Within  the  range  of  “room”  temperatures, 
consider  the  temperature  inside  a  cell,  temperature  of  a  state  that  is  fast  varying 
and  must  be  measured  every  second,  or  temperature  on  the  surface  of  Mars.  Each 
of  these  cases  requires  different  methods,  A  scientist  may  examine  the  shift  of 
tectonic  plates  by  comparing  the  distances  on  the  order  of  tens  of  kilometers 
over  the  time  period  of  a  year,  when  the  accuracy  is  below  a  millimeter. 

Whenever  we  consider  expansion  of  operational  definitions  for  an  empirical 
concept  C  to  a  new  range  R,  the  situation  is  similar: 

(1)  we  can  observe  objects  in  R  for  which  C  cannot  be  measured  with  the 
needed  accuracy; 

(2)  some  other  attributes  Ai, ...,  An  of  objects  in  R  can  be  measured,  or  else 
those  objects  would  not  be  empirically  available; 

(3)  some  of  Ai, ...,  A„  are  linked  to  C  by  empirical  laws  or  theories.  We  can 
use  one  or  more  of  those  laws  in  a  new  method:  measure  some  of  Ai, ...,  A„  and 
then  use  laws  to  compute  the  value  of  C. 

Consider  the  task:  determine  distance  D  from  Earth  to  each  in  a  set  R  of 
galaxies,  given  some  of  the  measured  properties  of  R:  Ai,  A2, A^j.  Operational 
definitions  for  Ai, ...,  An  are  available  in  the  range  R.  For  instance,  let  A2  mea¬ 
sure  the  redshift  of  hydrogen  spectrum.  Let  D  ~  h(A2)  be  Hubble  Law.  The 
new  method  is: 

For  a  galaxy  g,  when  no  individual  cefeids  can  be  distinguished: 

Measure  A2  of  the  light  coming  from  g  by  a  method  of  spectral  analysis 

Compute  the  distance  D (Earth,  g)  as  h(A2(g)) 

The  same  schema  can  yield  other  operational  definitions  that  determine  dis¬ 
tance  by  properties  measurable  in  a  new  range,  such  as  yearly  parallax,  perceived 
brightness  or  electromagnetic  spectrum. 

Some  laws  cannot  be  used  even  though  they  apply  to  galaxies.  Consider 
D  =  aJy/B  {B  is  brightness).  It  applies  even  to  the  most  remote  sources  of 
light.  But  B  used  in  the  law  is  the  absolute  brightness  at  the  source,  not  the 
brightness  perceived  by  an  observer.  Only  when  we  could  determine  the  absolute 
brightness,  we  could  determine  the  distance  to  galaxies  by  Z>  =  a/y/B. 

The  following  algorithm  can  be  used  in  many  applications: 
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Algorithm: 

Input:  set  of  objects  observed  in  range  R 

attribute  C  that  cannot  be  measured  in  R 
set  of  attributes  Al,...,Ak  that  can  be  measured  in  R 
set  of  known  operational  definitions  for  C 

set  LAVS  of  known  empirical  laws 

Output:  a  method  by  which  the  values  of  C  can  be  determined  in  R 

Find  in  LAWS  a  law  L  in  which  C  occurs 

Let  B1 . Bm  be  the  remaining  attributes  that  occur  in  L 

Verify  that  C  can  be  computed  from  L,  and  the  values  of  Bl,...,Bm 
Verify  that  {B1 , . . . ,Bm}  is  subset  of  {A1 , . . . , Ak} , 

that  is,  Bl,...,Bm  can  be  measured  in  at  least  some  situations  in  R 
Use  L  €uid  Bl,..,,Bm  to  create  new  procedure  F  for  C 
Make  F  consistent  with  procedures  in  {Fl,...,Fp} 

After  the  first  such  procedure  has  been  found,  the  search  may  continue  for  each 
law  that  involves  (7. 

In  set- theoretic  terms,  each  expansion  of  concept  (7  to  a  new  range  R  can  be 
viewed  as  a  mapping  from  the  set  of  distinguishable  classes  of  equivalence  with 
respect  to  C  for  objects  in  to  a  set  of  possible  new  values  of  (7,  for  instance,  the 
values  larger  than  those  that  have  been  observed  with  the  use  of  the  previous 
methods.  But  possible  expansions  are  unlimited.  The  use  of  an  existing  law 
narrows  down  the  scope  of  possible  concept  expansions  to  the  number  of  laws 
for  which  the  above  algorithm  succeeds.  But  the  use  of  an  existing  law  does 
not  merely  reduce  the  choices,  it  also  justifies  them.  Which  of  the  many  values 
that  can  be  assigned  to  a  given  state  corresponds  to  its  temperature?  If  laws 
reveal  the  real  properties  of  physical  objects,  then  the  new  values  which  fit  a  law 
indicate  concept  expansion  which  has  a  potential  for  the  right  choice. 

Claim  22:  Whenever  the  empirical  methods  expands  to  new  territories,  new 
discoveries  follow.  New  procedures  are  instrumental  to  that  growth. 

Claim  23:  Each  new  procedure  expands  the  law  it  uses  to  a  new  range.  If 
procedures  Pi  and  P2  use  laws  Li  and  L2  respectively,  and  produce  empirically 
inconsistent  results  for  new  objects  in  range  i2,  the  choice  of  Pi  will  make  L2 
false  in  R. 

If  a  number  of  procedures  provide  alternative  concept  expansions,  various 
selection  criteria  can  be  used,  depending  on  the  goal  of  research. 

Claim  24:  Among  two  methods,  prefer  the  one  which  has  a  broader  range,  for 
it  justifies  concept  expansion  by  a  broader  expansion  of  an  existing  law. 

Claim  25:  Among  two  methods,  prefer  the  one  which  has  a  higher  accuracy, 
since  it  provides  more  accurate  data  for  the  expansion  of  empirical  theories. 

Claim  26:  Methods  must  and  can  be  verified  in  their  new  area  of  application 
or  else,  the  empirical  laws  they  apply  would  be  mere  definitions. 
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8  Operational  definitions  apply  to  all  empirical  concepts 

While  explicit  operational  definitions  are  rarely  formed  by  experimental  scien¬ 
tists,  they  become  necessary  in  autonomous  robots.  A  robot  explorer  can  also 
benefit  from  mechanisms  for  generation  of  new  procedures. 

Operational  meaning  applies  to  databases.  They  are  repositories  of  facts  that 
should  be  shared  as  a  major  resource  for  knowledge  discovery  and  verification. 
But  data  and  knowledge  can  be  only  useful  for  those  who  understand  their  mean¬ 
ing.  Operational  definitions  describe  how  the  values  of  all  fields  were  produced. 

Similarly  to  our  science  examples,  operational  definitions  can  be  generated 
from  data  and  applied  in  different  databases.  Consider  a  regularity  L,  discovered 
in  a  data  table  Z),  which  provides  accurate  predictions  of  attribute  C  from  known 
values  of  Ai , ...,  An .  can  be  used  as  a  method  that  determines  values  of  C . 

Consider  now  another  table  Di,  that  covers  situations  similar  to  D,  but  differs 
in  some  attributes.  Instead  of  test  C,  tests  Bi, ...,  are  provided,  which  may 
or  may  not  be  compatible  with  C.  Suppose  that  a  doctor  who  has  been  familiar 
with  test  C  at  his  previous  workplace,  issues  a  query  against  Di  that  includes 
attribute  C  which  is  not  in  jDi.  A  regular  query  answering  mechanism  would  fail, 
but  a  mechanism  that  can  expand  operational  meaning  of  concepts  may  handle 
such  a  query  (Ras,  1997).  A  quest  Q  for  operational  definition  of  concept  C  with 
the  use  of  Ri, ...,  Bm  will  be  send  to  other  databases.  If  an  operational  definition 
is  found,  it  is  used  to  compute  the  values  of  C  in  the  doctor’s  query. 
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ABSTRACT:  In  real-life  data,  in  general,  many  attribute  values  are  missing. 
Therefore,  rule  induction  requires  preprocessing,  where  missing  attribute 
values  are  replaced  by  appropriate  values.  The  rule  induction  method  used  in 
our  research  is  based  on  rough  set  theory. 

In  this  paper  we  present  our  results  on  a  new  approach  to  missing  attribute 
values  called  a  closest  fit.  The  main  idea  of  the  closest  fit  is  based  on 
searching  through  the  set  of  all  cases,  considered  as  vectors  of  attribute 
values,  for  a  case  that  is  the  most  similar  to  the  given  case  with  missing 
attribute  values.  There  are  two  possible  ways  to  look  for  the  closest  case:  we 
may  restrict  our  attention  to  the  given  concept  or  to  the  set  of  all  cases. 
These  methods  are  compared  with  a  special  case  of  the  closest  fit  principle: 
replacing  missing  attribute  values  by  the  most  common  value  from  the 
concept.  All  algorithms  were  implemented  in  system  OOMIS.  Our 
experiments  were  performed  on  preterm  birth  data  sets  collected  at  the  Duke 
University  Medical  Center. 

KEYWORDS:  Missing  attribute  values,  closest  fit,  data  mining,  rule 
induction,  classification  of  unseen  cases,  system  OOMIS,  rough  set  theory. 


1  Introduction 

Recently  data  mining,  i.e.,  discovering  knowledge  from  raw  data,  is  receiving  a 
lot  of  attention.  Such  data  are,  as  a  rule,  imperfect.  In  this  paper  our  main  focus  is  on 
missing  attribute  values,  a  special  kind  of  imperfection.  Another  form  of 
imperfection  is  inconsistency — ^the  data  set  may  contain  conflicting  cases  (examples), 
having  the  same  values  of  all  attributes  yet  belonging  to  different  concepts  (classes). 

Knowledge  considered  in  this  paper  is  expressed  in  the  form  of  rules,  also  called 
production  rules.  Rules  are  induced  from  given  input  data  sets  by  algorithms  based  on 
rough  set  theory.  For  each  concept  lower  and  upper  approximations  are  computed,  as 
defined  in  rough  set  theory  [4,  6,  12,  13], 

Often  in  real-life  data  some  attribute  values  are  missing  (or  unknown).  There  are 
many  approaches  to  handle  missing  attribute  values  [3,  5,  7].  In  this  paper  we  will 
discuss  an  approach  based  on  the  closest  fit  idea.  The  closest  fit  algorithm  for 
missing  attribute  values  is  based  on  replacing  a  missing  attribute  value  by  existing 
values  of  the  same  attribute  in  another  case  that  resembles  as  much  as  possible  the 
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case  with  the  missing  attribute  values.  In  searching  for  the  closest  fit  case,  we  need 
to  compare  two  vectors  of  attribute  values  of  the  given  case  with  missing  attribute 
values  and  of  a  searched  case. 

There  are  many  possible  variations  of  the  idea  of  the  closest  fit.  First,  for  a 
given  case  with  a  missing  attribute  value,  we  may  look  for  the  closest  fitting  cases 
within  the  same  concept,  as  defined  by  the  case  with  missing  attribute  value,  or  in  all 
concepts,  i.e.,  among  all  cases.  The  former  algorithm  is  called  concept  closest  fit,  the 
latter  is  called  global  closest  fit. 

Secondly,  we  may  look  at  the  closest  fitting  case  that  has  all  the  same  values, 
including  missing  attribute  values,  as  the  case  with  a  missing  attribute  value,  or  we 
may  restrict  the  search  to  cases  with  no  missing  attribute  values.  In  other  words,  the 
search  is  performed  on  cases  with  missing  attribute  values  or  among  cases  without 
missing  attribute  values. 

During  the  search,  the  entire  training  set  is  scanned,  for  each  case  a  proximity 
measure  is  computed,  the  case  for  which  the  proximity  measure  is  the  largest  is  the 
closest  fitting  case  that  is  used  to  determine  the  missing  attribute  values.  The 
proximity  measure  between  two  cases  e  and  e'  is  the  Manhattan  distance  between  e  and 
e\  i.e.. 


where 


distance  cj)  = 


ri 

2  distance  (e/,  ej), 
/=  1 


if  ei  and  el  are  symbolic  and  ei  ^  el, 
if  ei  =  el. 


If! 


\ai  -  bi\ 


if  ei  and  el  are  numbers  and  ei  el. 


where  a/  is  the  maximum  of  values  of  A/,  bf  is  the  minimum  of  values  of  A/,  and  A / 
is  an  attribute. 

In  a  special  case  of  the  closest  fit  algorithm,  called  the  most  common  value 
algorithm,  instead  of  comparing  entire  vectors  of  attribute  values,  the  search  is  reduced 
to  just  one  attribute,  the  attribute  for  which  the  case  has  a  missing  value.  The 
missing  value  is  replaced  by  the  most  frequent  value  within  the  same  concept  to 
which  belongs  the  case  with  a  missing  attribute  value. 

2  Rule  Induction  and  Classification  of  Unseen  Cases 

In  our  experiments  we  used  LERS  (Learning  from  Examples  based  on  Rough  Set 
theory)  for  rule  induction.  LERS  has  four  options  for  rule  induction;  only  one,  called 
LEM2  [4,  6]  was  used  for  our  experiments.  Rules  induced  from  the  lower 
approximation  of  the  class  certainly  describe  the  class,  so  they  are  called  certain.  On 
the  other  hand,  rules  induced  from  the  upper  approximation  of  the  class  describe  only 
possibly  (or  plausibly)  cases,  so  they  are  called  possible  [8].  Examples  of  other  data 
mining  systems  based  on  rough  sets  are  presented  in  [14,  16]. 

For  classification  of  unseen  cases  system  LERS  uses  a  modified  "bucket  brigade 
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algorithm”  [2,  10].  The  decision  to  which  class  a  case  belongs  is  made  on  the  basis 
of  two  parameters:  strength  and  support.  They  are  defined  as  follows:  Strength  is  the 
total  number  of  cases  correctly  classified  by  the  rule  during  training.  The  second 
parameter,  support^  is  defined  as  the  sum  of  scores  of  all  matching  rules  from  the 
class.  The  class  C  for  which  the  support,  i.e.,  the  value  of  the  following  expression 

Z  Strength(/?) 

matching  rules  R  describing  C 


is  the  largest  is  a  winner  and  the  case  is  classified  as  being  a  member  of  C.  The 
above  scheme  reminds  non-democratic  voting  in  which  voters  vote  with  their 
strengths. 

If  a  case  is  not  completely  matched  by  any  rule,  some  classification  systems  use 
partial  matching.  During  partial  matching,  system  AQ15  uses  the  probabilistic  sum 
of  all  measures  of  fit  for  rules  [11].  Another  approach  to  partial  matching  is  presented 
in  [14].  Holland  et  al.  [10]  do  not  consider  partial  matching  as  a  viable  alternative  of 
complete  matching  and  thus  rely  on  a  default  hierarchy  instead.  In  LERS  partial 
matching  does  not  rely  on  the  input  of  the  user.  If  complete  matching  is  impossible, 
all  partially  matching  rules  are  identified.  These  are  rules  with  at  least  one  attribute- 
value  pair  matching  the  corresponding  attribute-value  pair  of  a  case. 

For  any  partially  matching  rule  R,  the  additional  factor,  called  Matching  Jactor 
(R),  is  computed.  Matching_factor  is  defined  as  the  ratio  of  the  number  of  matched 
attribute- value  pairs  of  a  rule  with  a  case  to  the  total  number  of  attribute-value  pairs 
of  the  rule.  In  partial  matching,  the  class  C  for  which  the  value  of  the  following 
expression 


Z  Matching_factor(R)  *  Strength  (/?) 

partially  matching  rules  R  describing  C 

is  the  largest  is  the  winner  and  the  case  is  classified  as  being  a  member  of  C. 

During  classification  of  unseen  (testing)  cases  with  missing  attribute  values, 
missing  attribute  values  do  not  participate  in  any  attempt  to  match  a  rule  during 
complete  or  partial  matching.  A  case  can  match  rules  using  only  actual  attribute 
values. 

3  Description  of  Data  Sets  and  Experiments 

Data  sets  used  for  our  experiments  come  from  the  Duke  University  Medical  Center. 
First,  a  large  data  set,  with  1,229  attributes  and  19,970  cases  was  partitioned  into  two 
parts:  training  (with  14,977  cases)  and  testing  (with  4,993  cases).  We  selected  two 
mutually  disjoint  subsets  of  the  set  of  all  1,229  attributes,  the  first  set  containing  52 
attributes  and  the  second  with  54  attributes  and  called  the  new  data  sets  Duke-1  and 
Duke-2,  respectively.  The  Duke-1  data  set  contains  laboratory  test  results.  The 
Duke-2  test  represents  the  most  essential  remaining  attributes  that,  according  to 
experts,  should  be  used  in  diagnosis  of  preterm  birth.  Both  data  sets  were  unbalanced 
because  only  3,103  cases  were  preterm,  all  remaining  11,874  cases  were  fullterm. 
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Table  1.  Missing  attribute  values 

Number  of  missing  attribute  values  in  data  sets  processed  by 

_ Global  closest  fit  Concept  closest  fit  Most  common  value 

Duke-1  1,1641  505,329  0 

Duke-2  615  1,449  0 

Similarly,  in  the  testing  data  set,  there  were  only  1,023  preterm  cases  while  the 
number  of  fullterm  cases  was  3,970. 

Both  data  sets,  Duke-1  and  Duke-2,  have  many  missing  attribute  values  (Duke-1 
has  505,329  missing  attribute  values,  i.e.,  64.9%  of  the  total  number  of  attribute 
values;  Duke-2  has  291,796  missing  attribute  values,  i.e.,  36.1%  of  the  total  number 
of  attribute  values). 

First,  missing  attribute  values  were  replaced  by  actual  values.  Both  data  sets 
were  processed  by  the  previously  described  five  algorithms  of  the  OOMIS  system: 
global  closest  fit  and  concept  closest  fit,  among  all  cases  with  and  without  missing 
attribute  values,  and  most  common  value. 

Since  the  number  of  missing  attribute  values  in  Duke- 1  or  Duke-2  is  so  large,  we 
were  successful  in  using  only  three  algorithms.  The  version  of  looking  for  the 
closest  fit  among  all  cases  without  missing  attribute  values  returned  the  unchanged, 
original  data  sets.  Therefore,  in  the  sequel  we  will  use  names  global  closest  fit  and 
concept  closest  fit  for  algorithms  that  search  among  all  cases  with  missing  attribute 
values.  For  Duke-1  the  concept  closest  fit  algorithm  was  too  restrictive:  All  missing 
attribute  values  were  unchanged,  so  we  ignored  the  Duke-1  data  set  processed  by  the 
concept  closest  fit  algorithm.  Moreover,  global  closest  fit  or  concept  closest  fit 
algorithms  returned  data  sets  with  only  reduced  number  of  missing  attribute  values. 
The  results  are  presented  in  Table  1. 

Since  using  both  closest  fit  options  result  in  some  remaining  missing  attribute 
values,  for  the  output  files  the  option  most  common  value  was  used  to  replace  all 
remaining  missing  attribute  values  by  the  actual  attribute  values.  Thus,  finally  we 


Table  2. 

Training  data  sets 

Global 
closest  fit 

Concept 
closest  fit 

Most 

common  value 

Number  of 
conflicting  cases 

8,691 

10,028 

Duke-1 

Number  of 
unique  cases 

6,314 

4,994 

Number  of 
conflicting  cases 

7,839 

0 

8,687 

Duke-2 

Number  of 
unique  cases 

7,511 

9,489 

6,295 
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obtained  five  pre-processed  data  sets  without  any  missing  attribute  values. 

To  reduce  error  rate  during  classification  we  used  a  very  special  discretization. 
First,  in  the  training  data  set,  for  any  numerical  attribute,  values  were  sorted.  Every 
value  V  was  replaced  by  the  interval  [v,  w),  where  w  was  the  next  bigger  values  than  v 
in  the  sorted  list.  Our  approach  to  discretization  is  the  most  cautious  since,  in  the 
training  data  set,  we  put  only  one  attribute  value  in  each  interval.  For  testing  data 
sets,  values  were  replaced  by  the  corresponding  intervals  taken  from  the  training  data 
set.  It  could  happen  that  a  few  values  come  into  the  same  interval. 

Surprisingly,  four  out  of  five  training  data  sets,  after  replacing  missing  attribute 
values  by  actual  attribute  values  and  by  applying  our  cautious  discretization,  were 
inconsistent.  The  training  data  sets  are  described  by  Table  2. 

For  inconsistent  training  data  sets  only  possible  rule  sets  were  used  for 
classification.  Certain  rules,  as  follows  from  [8],  usually  provide  a  greater  error  rate. 
Rule  induction  was  a  time-consuming  process.  On  a  DEC  Alpha  21164  computer, 
with  512  MB  of  RAM,  533  MHz  clock  speed,  rule  sets  were  induced  in  elapsed  real 
time  between  21  (for  Duke-2  processed  by  the  concept  closest  fit  algorithm)  and  133 
hours  (for  Duke-2  processed  by  the  global  concept  fit  algorithm).  Some  statistics 
about  rule  sets  are  presented  in  Table  3. 

As  follows  from  Table  3,  as  a  result  of  unbalanced  data  sets,  the  average  rule 
strength  for  rules  describing  fullterm  birth  is  much  greater  than  the  corresponding  rule 
strength  for  preterm  birth.  Consequently,  the  error  rate  on  the  original  rule  sets  is  not 
a  good  indicator  of  the  quality  of  a  rule  set,  as  follows  from  [9]. 

Our  basic  concept  is  the  class  of  preterm  cases.  Hence  the  set  of  all  correctly 
predicted  preterm  cases  are  called  true-positives,  incorrectly  predicted  preterm  cases 
(i.e.,  predicted  as  fullterm)  are  called  false-negatives,  correctly  predicted  fullterm  cases 
are  called  true-negatives,  and  incorrectly  predicted  fullterm  cases  are  called  false- 
positives. 

Sensitivity  is  the  conditional  probability  of  true-positives  given  actual  preterm 
birth,  i.e.,  the  ratio  of  the  number  of  true-positives  to  the  sum  of  the  number  of  true- 


Table  3.  Rule  sets 


Global 

Concept 

Most 

closest  fit 

closest  fit 

common  value 

Number  of 

Preterm 

734 

- 

618 

rules 

Fullterm 

710 

- 

775 

Duke-1 

Average 
strength 
of  rule  set 

Preterm 

Fullterm 

4.87 

39.08 

- 

8.97 

44.73 

Number  of 

Preterm 

1,202 

483 

1,022 

rules 

Fullterm 

1,250 

583 

1,642 

Duke-2 

Average 

strength 

Preterm 

2.71 

9.69 

4.60 

of  rule  set 

Fullterm 

15.8 

43.99 

11.37 
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Rule  strength  multiplier 


Fig.  1.  P(TP)  -  P(FP)  versus  rule  strength  multiplier  for  Duke-2  data  set  and  most 
common  value  method  used  for  replacing  missing  attribute  values 


positives  and  false-negatives.  It  will  be  denoted  by  P(TP),  following  notation  from 
[15].  Specificity  is  the  conditional  probability  of  true-negatives  given  fullterm  birth, 
i.e.,  the  ratio  of  the  number  of  true-negatives  to  the  sum  of  the  number  of  true- 
negatives  and  false-positives.  It  will  be  denoted  by  P(TN).  Similarly,  the  conditional 
probability  of  false-negatives,  given  actual  preterm  birth,  and  equal  to  1  -  P(TP),  will 
be  denoted  by  P(FN)  and  the  conditional  probability  of  false-positives,  given  actual 
fullterm  birth,  and  equal  to  1  -  P(TN),  will  be  denoted  by  P(FP).  Obviously, 

Sensitivity  +  Specificity  =  P(TP)  -  P(FP)  +  1, 

so  all  conclusions  drawn  from  the  observations  of  the  sum  of  sensitivity  and 
specificity  can  be  drawn  from  observations  of  P(TP)  -  P(FP).  Another  study  of  the 
sum  of  sensitivity  and  specificity  was  presented  in  [1]. 

Following  [9],  we  computed  the  maximum  of  the  difference  between  the 
conditional  probabilities  for  true-positives  given  actual  preterm  birth  and  false- 
positives  given  actual  fullterm  birth  as  a  function  of  the  rule  strength  multiplier  for 
the  preterm  rule  set.  A  representative  chart  is  presented  in  Fig.  1.  For  completeness, 
a  typical  chart  (Fig.  2)  shows  how  the  true-positive,  true-negative  and  total  error  rate 
change  as  a  function  of  the  rule  strength  multiplier.  The  total  error  rate  is  defined  as 
the  ratio  of  the  number  of  true-positives  and  true-negatives  to  the  total  number  of 
testing  cases. 
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Fig.  2.  Sensitivity  (series  1),  specificity  (series  2),  and  total  error  rate  (series  3) 
versus  rule  strength  multiplier  for  Duke-2  data  set  and  most  common  value  method 
used  for  replacing  missing  attribute  values 

Again,  following  the  idea  from  [9],  in  our  experiments  we  were  increasing  the 
strength  multiplier  for  each  five  rules  describing  preterm  birth  and  observed  P(TP)  - 
P(FP).  For  each  rule  set,  there  exists  some  value  of  the  rule  strength  multiplier, 
called  critical  for  which  the  values  of  P(TP)  -  P(FP)  jumps  from  the  minimal  value 
to  the  maximal  value.  The  respective  values  of  true  positives,  true  negatives,  etc., 
and  the  total  error  rate,  are  also  called  critical.  The  results  are  summarized  in  Table  4. 
The  total  error  rate,  corresponding  to  the  rule  strength  multiplier  equal  to  one,  is 
called  initial. 

The  corresponding  values  of  P(TP)  -  P(FP)  are  presented  in  Table  4.  The  critical 
total  error  rate  from  Table  4  is  computed  as  the  total  error  rate  for  the  maximum  of 
P(TP)  -  P(FP). 

4  Conclusions 

In  our  experiments  the  only  difference  between  the  five  rule  sets  used  for 
diagnosis  of  preterm  birth  is  handling  the  missing  attribute  values.  The  maximum  of 
the  sum  of  sensitivity  and  specificity  (or  the  maximum  of  P(TP)  -  P(FP))  is  a  good 
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Table  4.  Results  of  experiments 

Global  Concept  Most 


closest  fit 

closest  fit 

common  value 

Duke-1 

Duke-2 

Duke-2 

Duke-1 

Duke-2 

Initial  total 

error  rate 

21.67 

21.93 

20.75 

22.15 

22.27 

Critical  total 

error  rate 

68.48 

64.09 

54.30 

42.40 

45.88 

Maximum  of 

P(TP)  -  P(FP) 

3.65 

5.97 

11.69 

17.07 

14.43 

Minimum  of 

P(TP)  -  P(FP) 

-15.96 

-11.28 

-5.37 

-3.52 

-2.67 

Critical  number  of 
true-positives 

882 

838 

747 

615 

639 

Critical  number 
of  true-negatives 

692 

955 

1535 

2261 

2063 

Critical  rule 
strength  multiplier 

8.548 

6.982 

6.1983 

6.1855 

3.478 

indicator  of  usefulness  of  the  rule  set  for  diagnosis  of  preterm  birth.  It  is  the  most 
important  criterion  of  quality  of  the  rule  set.  In  terms  of  the  maximum  of  the  sum  of 
sensitivity  and  specificity  (or,  equivalently,  the  maximum  of  P(TP)  -  P(FP)),  the  best 
data  sets  were  processed  by  the  most  common  value  algorithm  for  missing  attribute 
values.  Note  that  the  name  of  the  algorithm  is  somewhat  misleading  because,  in  our 
experiments,  we  used  this  algorithm  to  compute  the  most  connnon  attribute  value  for 
each  concept  separately.  The  next  best  method  is  the  concept  closest  fit  algorithm. 
The  worst  results  were  obtained  by  the  global  closest  fit. 

The  above  ranking  could  be  discovered  not  only  by  using  the  criterion  of  the 
maximum  of  the  sum  of  sensitivity  and  specificity  but  also  by  using  other  criteria, 
for  example,  the  minimum  of  the  sum  of  sensitivity  and  specificity,  the  number  of 
critical  true-positive  cases,  critical  false-positive  cases,  etc. 

The  initial  total  error  rate  is  a  poor  indicator  of  the  performance  of  an  algorithm 
for  handling  missing  attribute  values.  Similarly,  the  number  of  conflicting  cases  in 
the  input  data  is  a  poor  indicator. 

Finally,  it  can  be  observed  that  the  smaller  values  of  the  minimum  of  P(TP)  - 
P(FP)  correspond  to  the  smaller  values  of  the  maximum  of  P(TP)  -  P(FP),  so  that 
the  sum  of  the  absolute  values  of  these  two  numbers  is  roughly  speaking  constant. 
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Abstract.  This  paper  presents  a  post-processing  algorithm  of  rule  dis¬ 
covery  for  augmenting  the  readability  of  a  discovered  rule  set.  Rule  dis¬ 
covery,  in  spite  of  its  usefulness  as  a  fundamental  data-mining  technique, 
outputs  a  huge  number  of  rules.  Since  usefulness  of  a  discovered  rule  is 
judged  by  human  inspection,  augmenting  the  readability  of  a  discovered 
rule  set  is  an  importemt  issue.  We  formalize  this  problem  as  a  trans¬ 
formation  of  a  rule  set  into  a  tree  structure  called  a  visual  graph.  A 
novel  information-based  criterion  which  represents  compressed  entropy 
of  a  data  set  per  description  length  of  the  graph  is  employed  in  order  to 
evaluate  the  readability  quantitatively.  Experiments  with  an  agricultural 
data  set  in  cooperation  with  domain  experts  confirmed  the  effectiveness 
of  our  method  in  terms  of  readability  and  validness. 


1  Introduction 

Knowledge  Discovery  in  Databases  (KDD)  [4]  represents  a  novel  research  area  for 
discovering  useful  knowledge  from  large-scale  data.  With  the  rapid  proliferation 
of  large-scale  databases,  increasing  attention  has  been  paid  to  KDD.  In  KDD, 
rule  discovery  [1, 7, 9]  represents  induction  of  local  constraints  in  a  data  set.  Rule 
discovery  is,  due  to  its  applicability,  one  of  the  most  fundamental  and  important 
methods  in  KDD. 

In  general,  a  huge  number  of  rules  are  discovered  from  a  data  set.  In  order 
to  evaluate  interestingness  of  a  discovered  rule  set  precisely,  it  is  desirable  to  de¬ 
crease  the  number  of  uninteresting  rules  and  to  output  the  rule  set  in  a  readable 
representation.  However,  conventional  rule-discovery  methods  [1,7,9]  consider 
mainly  generality  and  accuracy  of  a  rule,  and  readability^  of  a  discovered  rule 
set  has  been  curiously  ignored.  Usefulness  of  a  rule  can  be  only  revealed  through 
human  inspection.  Therefore,  visualization  of  a  discovered  rule  set  is  considered 
to  be  highly  important  since  it  augments  their  readability. 

Rule  discovery  can  be  classified  into  two  approaches:  one  is  to  discover  strong 
rules  each  of  which  explains  many  examples,  and  the  other  is  to  discover  weak 

^  In  this  paper,  we  define  readability  as  simplicity  and  informativeness. 
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rules  each  of  which  explains  a  small  number  of  examples  [8, 10].  This  paper 
belongs  to  the  first  approach,  and  presents  a  method  which  transforms  a  set 
of  strong  rules  with  the  same  conclusion  into  a  readable  representation.  As  a 
representation,  we  consider  a  visual  graph  which  explains  the  conclusion  with 
premises  agglomerated  with  respect  to  their  frequencies.  There  exist  methods  for 
discovering  graph-structured  knowledge,  such  as  Bayesian  network  [6]  and  ED  AG 
[5].  However,  our  method  is  different  from  these  methods  since  readability  is  our 
main  goal.  We  propose,  as  a  novel  criterion  for  evaluating  readability  of  a  visual 
graph,  compressed  entropy  density  which  is  given  as  compressed  entropy  of  the 
data  set  per  description  length  of  the  graph.  We  demonstrate  the  effectiveness 
of  our  method  by  experiments  using  an  agricultural  data  set  in  cooperation  with 
domain  experts. 

2  Problem  Description 

In  this  paper,  we  consider  transforming  a  data  set  D  and  a  rule  set  R  into  a  visual 
graph  G(I>,  5),  where  5  is  a  subset  of  R  and  represents  the  rule  set  contained 
in  (j(D,  5).  We  assume  that  the  number  \S\  of  rules  in  the  rule  set  5  is  specified 
as  a  threshold  by  the  user  prior  to  the  transformation. 

The  data  set  D  consists  of  several  examples  each  of  which  is  described  with 
a  set  of  propositional  attributes.  Here,  a  continuous  attribute  is  supposed  to  be 
discretized  with  an  existing  method  such  as  [3],  and  is  co verted  to  a  nominal 
attribute.  An  event  representing  that  an  attribute  has  one  of  its  values  is  called 
an  atom.  Proportion  of  examples  each  of  which  satisfies  an  atom  a  is  represented 
by  Pr(a). 

The  rule  set  R  consists  of  \R\  rules  ^’i,  ?’2j  *  *  * » which  are  discovered  with 
an  existing  method  [7, 9]  from  the  data  set  D. 

Ji=  {n,r2,---,r|R|}  (1) 

In  KDD,  important  classes  of  rules  include  an  association  rule  [1]  and  a  conjunc¬ 
tion  rule  [7,9].  In  an  association  rule,  every  attribute  is  assumed  to  be  binary, 
and  a  value  in  an  atom  is  restricted  to  “true”.  An  association  rule  represents  a 
rule  of  which  premise  and  conclusion  are  either  a  single  atom  or  a  conjunction 
of  atoms.  In  a  conjunction  rule,  every  attribute  is  assumed  to  be  nominal.  A 
conjunction  rule  represents  a  rule  of  which  premise  is  either  a  single  atom  or  a 
conjunction  of  atoms,  and  conclusion  is  a  single  atom.  In  this  paper,  we  con¬ 
sider  conjunction  rules  since  they  assume  a  more  general  class  of  attributes  than 
association  rules.  For  simplification,  we  assume  that  each  rule  r*  has  the  same 
conclusion  x. 

=  yii  A  A  •  •  •  A  Vivii)  «  (2) 

where  j/n,  i^t2>  *  *  * » ®  represent  a  single  atom  with  different  attributes  re¬ 
spectively. 

A  visual  graph  G(D^T)  represents,  in  a  graph  format,  a  rule  set  T  which 
consists  of  |T|  rules  ti,<2)  •  *  *  As  mentioned  above,  this  rule  set  T  is  a 
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subset  of  the  input  rule  set  i?.  A  visual  graph  G(D,  T)  is  a  tree  structure  in  which 
n(Z>,  T)  premise  nodes  6i(D,  T),  T),  •  •  • ,  hn{D,T){D^  T)  has  their  respective 

arc  to  a  conclusion  node  ^)*  Here,  the  conclusion  node  bo{Dy  T)  represents 
the  atom  x  of  conclusions  in  the  rule  set  T.  A  premise  node  hi(Z),  T)  represents 
the  premises  of  rules  each  of  which  has  the  i-th  most  frequent  atom  di  in  the  rule 
set  T.  Our  method  constructs  a  premise  node  6»(D,  T)  with  an  ascending  order 
of  i,  and  a  rule  represented  in  a  premise  node  is  no  longer  represented  in  the 
successive  premise  nodes.  When  more  than  two  atoms  have  the  same  number 
of  occurrence,  the  atom  with  the  smallest  subscript  is  selected  first.  Figure  1 
shows  an  example  of  a  rule  set  and  its  corresponding  visual  graph.  In  the  visual 
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rule  set  visual  graph 


Fig.  1.  Example  of  a  rule  set  and  its  corresponding  visual  graph 


graph,  the  upmost  atom  x  represents  a  conclusion  node,  and  the  other  nodes 
axe  premise  nodes.  In  the  figure,  the  most  frequent  atom  u  is  first  agglomerated 
as  a  node,  and  the  premises  of  the  four  rules  each  of  which  contains  u  represent 
the  premise  node  1.  Although  three  rules  contain  atom  r,  the  premise  node  2 
represents  two  rules  since  one  of  the  three  rules  is  employed  in  the  premise  node 
1. 

While  a  visual  graph  is  uniquely  determined  by  a  rule  set  5,  there  are  |ij|C|5| 
ways  of  selecting  a  subset  S  from  the  rule  set  R.  In  the  next  section,  we  describe 
how  to  choose  a  subset  5  from  the  rule  set  R  in  order  to  obtain  a  visual  graph 
G(DyS)  with  high  readability. 

3  Transformation  of  a  Rule  Set  into  a  Visual  Graph 

3.1  Compressed  Entropy  Density  Criterion 

In  order  to  obtain  a  visual  graph  with  high  readability,  an  appropriate  subset  S 
should  be  selected  from  the  rule  set  R.  In  this  paper,  we  consider  an  evaluation 
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criterion  for  the  readability  of  a  visual  graph,  and  propose  a  novel  method  which 
does  not  necessarily  require  user  interaction. 

The  readability  of  a  visual  graph  depends  on  two  main  factors.  One  factor  is 
graph  complexity  which  can  be  represented  by  the  number  of  nodes  and  arcs  in 
the  graph.  A  complex  graph  is  considered  to  have  low  readability.  For  example, 
if  we  consider  intuitively,  a  graph  with  300  nodes  has  lower  readability  than 
a  graph  with  30  nodes.  The  other  factor  is  graph  description-power  which  can 
be  represented  by  the  information  content  of  the  data  set  D  in  the  graph.  For 
example,  if  a  visued  graph  A  represents  a  subset  of  a  rule  set  represented  by 
another  visual  graph  B  and  these  two  graphs  have  the  same  complexity,  A  has 
lower  readability  than  B. 

As  explained  in  the  previous  section,  a  visual  graph  represents  a  tree  structure 
of  depth  one  in  which  each  premise  node  has  an  arc  to  a  conclusion  node.  Since 
the  atom  in  the  conclusion  node  is  fixed  and  the  depth  is  one,  visual  graphs  vary 
with  respect  to  the  atoms  in  the  premise  nodes.  Assuming  that  every  atom  has 
the  same  readability,  graph  complexity  can  be  approximated  by  the  number  of 
atoms  in  the  premise  nodes.  We  can  also  consider  the  branching  factor  of  the 
conclusion  node,  but  we  ignore  it  since  it  is  equal  to  the  number  of  premise  nodes 
and  can  be  approximately  estimated  by  the  number  of  atoms. 

In  order  to  provide  an  intuitive  interpretation  to  the  evaluation  criterion, 
we  represent  graph  complexity  by  its  description  length.  If  there  are  A  kinds 
of  atoms,  the  description  length  of  an  atom  is  log2  A  bit.  Therefore,  complexity 
f7(£),  T)  of  a  visual  graph  G{D^  T)  is  given  as  follows. 

U{D,T)  =  \G{D,T)\\og^A  (3) 

where  \G(D^T)\  represents  the  number  of  atoms  in  the  visual  graph  G(D,T). 
Since  A  is  fixed,  U(D,T)  is  a  linear  function  of  |6r(Z),  T)|, 

Since  a  visual  graph  and  a  rule  set  has  one-to-one  correspondence,  the  in¬ 
formation  content  of  a  data  set  D  represented  by  a  visual  graph  G(D,  T)  is 
equivalent  to  the  information  content  of  the  data  set  D  represented  by  the  rule 
set  T.  The  information  content  is  calculated  with  respect  to  either  the  whole 
rule  set  or  each  rule.  In  rule  discovery,  although  readability  should  be  consid¬ 
ered  with  respect  to  the  whole  rule  set,  usefulness  is  considered  for  each  rule. 
Therefore,  we  take  the  latter  approach.  We  first  obtain  the  information  content 
of  a  data  set  D  represented  by  each  rule  in  the  rule  set  T,  and  then  regard 
their  add-sum  as  the  graph-description  power  V  (D^  T)  for  the  data  set  D  of  the 
visual  graph  0(I>,T).  Note  that,  this  formalization  ignores  dependency  among 
rules.  We  have  also  pursued  another  formalization  in  which  premises  of  rules  are 
mutually  exclusive.  However,  this  approach  has  turned  out  to  be  less  effective 
by  experiments  with  an  agricultural  data  set. 

In  ITRULE  rule  discovery  system  [7],  Smyth  employed  compressed  entropy 
of  a  data  set  D  by  a  rule  t  :  j/  x  as  an  evaluation  criterion  7-measure  7(<,  D) 
of  the  rule. 


J(t,D)=Pi(y) 


(4) 
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where  *  represents  the  negation  of  the  atom  x,  /-measure  is  a  single  quantita¬ 
tive  criterion  which  simultaneously  evaluates  the  generality  Pr(y),  the  accuracy 
Pr(a;|y)  and  the  unexpectedness  Pr(a;|y)/Pr{a;)  of  a  rule,  and  is  reported  to  be 
effective  in  rule  discovery  [7].  Interested  readers  can  consult  [7]  for  theoretical 
foundation  and  empirical  behavior  of  /-measure.  In  this  paper,  we  represent 
information  content  of  a  data  set  D  by  each  rule  t  with  /-measure  /(<,£)). 
Therefore,  graph  description-power  V(DyT)  for  a  data  set  D  of  a  visual  graph 
G{DjT)  is  given  as  follows. 

V{D,T)  =  Y,Jit>D)  (6) 

teT 

Note  that  readability  of  a  visual  graph  G(D,  T)  decreases  with  respect  to 
graph  complexity  C/(/>,T),  and  increases  with  respect  to  graph  description- 
power  V{D^  T).  The  former  is  represented  by  the  description  length  of  the  graph, 
and  the  latter  is  represented  by  the  compressed  entropy  of  the  data  set  by  the 
graph.  Here,  the  quotient  of  graph  description-power  by  graph  complexity  rep¬ 
resents  compressed  entropy  of  the  data  set  per  description  length  of  the  graph, 
and  can  be  regarded  as  density  of  compressed  entropy.  If  this  quotient  of  a  graph 
is  large,  we  can  regard  the  graph  as  representing  information  of  the  data  set  with 
high  density.  We  propose,  as  the  evaluation  criterion  of  readability  of  a  visual 
graph,  compressed  entropy  density  W{D,T)  which  is  given  as  follows. 

Behavior  of  W(D^T)  cannot  be  analyzed  exactly  since  it  is  highly  dependent 
on  the  nature  of  the  input  data.  Probabilistic  analysis,  based  on  average  per¬ 
formance  over  all  possible  input  data  sets,  is  too  difficult  to  carry  out  directly 
without  invoking  unrealistic  assumptions  concerning  the  nature  of  the  inputs. 
We  leave  more  rigorous  analysis  of  the  problem  for  further  research. 


3.2  Search  Method 

Our  algorithm  obtains  a  rule  set  S  by  deleting,  one  by  one,  rules  in  the  input 
rule  set  R  until  the  number  of  rules  becomes  |5|.  In  a  KDD  process,  we  cannot 
overemphasize  the  importance  of  user  interaction  [2].  In  rule  visualization,  users 
may  iterate  visualization  procedure  by  inspecting  the  output  and  specifying  new 
,  conditions.  Therefore,  our  algorithm  employs  hill  climbing  since  its  computation 
time  is  relatively  short.  Our  algorithm  is  given  as  follows. 

1.  (Set)  T  <r-R 

2.  (Delete  rules) 

(a)  whae(|T|  >  |5|) 

(b)  T  <-  axg  ^ax  W(i?,  T  -  {<}) 

3.  (Return)  Return  G(£),T) 
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4  Application  to  an  Agriculture  Data  Set 

In  this  section,  we  demonstrate  the  effectiveness  of  our  method  by  applying  it 
to  “Agriculture”  data  sets.  “Agriculture”  is  a  series  of  data  sets  which  describes 
agricultural  statistics  such  €is  various  crops  for  approximately  3200  municipal¬ 
ities  in  Japan.  We  have  followed  suggestions  of  domain  experts  and  analyzed 
“semi-mountainous  municipalities”.  Japanese  ministry  of  agriculture  specified 
approximately  1700  municipalities  as  semi-mountainous  for  conservation  of  agri¬ 
culture  in  mountainous  regions,  and  analysis  on  these  municipalities  is  highly 
demanded.  We  have  used  the  1992  version  of  “Agriculture”,  and  there  are  1748 
semi-mountainous  municipalities  as  examples  in  the  data  set. 

Since  Japan  has  diverse  climates,  there  are  many  crops  each  of  which  is 
cultivated  in  a  restricted  region.  An  atom  representing  the  absence  of  such  a 
crop  is  frequent  in  discovered  rules.  However,  such  an  atom  is  uninteresting 
to  domain  experts  since  it  represents  another  view  of  climatic  conditions.  In 
order  to  ignore  such  atoms,  we  employed  148  attributes  each  of  which  has  a 
positive  value  in  at  leeist  one-third  of  municipalities.  These  attributes  represent, 
for  instance,  summaries  of  municipalities,  shipments  of  crops  and  acreages  of 
crops.  In  discretizing  a  continuous  attribute,  we  first  regarded  “0”  and  missing 
values  as  a  new  value,  then  employed  equal-frequency  method  [3]  of  three  bins. 

According  to  domain  experts,  conditions  on  high  income  are  their  main  in¬ 
terests.  First,  we  settled  the  atom  of  the  conclusion  “agricultural  income  per 
farmhouse  =  high”.  We  obtained  a  rule  set  which  consists  of  333  rules  with  a 
rule  discovery  method  [9].  For  the  rule  set  5  in  the  output  visual  graph,  we 
settled  as  |5|  =  15.  Figure  2  shows  the  result  of  this  experiment. 


agricultural  income  per  farmhouse^hlghl 


number  of  companies  per  populatlon«high 
annual  expenditures  for  agriculture  per  fanner*high 
annual  expenditures  per  population*high 
annual  revenue  per  population=high 


annual  expenditures  for  agriculture  per  farmershlgh 
annual  expenditures  per  populationshigh 
ntunber  of  companies  per  populatlon«high 
annual  revenue  per  popular ion=high 


ratio  of  forestslow 


ratio  of  cultivable  area>high 

•—ratio  of  cultivable  8rea>high 


Fig.  2.  Visual  graph  for  conclusion  “agricultural  income  per  farmhouse  =  high”  with 
|5|  =  15 
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Atoms  in  this  figure  can  be  classified  into  four  groups.  The  first  group  rep¬ 
resents  that  a  considerable  amount  of  subsidies  are  granted  by  the  administra¬ 
tion.  Atoms  which  belong  to  this  group  are  “agriculture  promotion  area=true” , 
“annual  expenditures  for  agriculture  per  f£irmer=high”,  “annual  expenditures 
per  population=high”  and  “annual  revenue  per  populatip|i=high”.  These  atoms 
represent  that  agriculture  is  highly-promoted  by  administrations,  and  their  fi¬ 
nancial  status  are  excellent.  The  second  group  represents  that  vegetables  are 
well-cultivated.  Atoms  which  belong  to  this  group  are  “vegetable  production 
per  far mer= high”,  “ratio  of  cultivable  area=high”  and  “ratio  of  forest  =  low”. 
These  atoms  represent  that  high  income  is  gained  with  vegetables,  and  acreage 
for  vegetables  is  large.  According  to  domain  experts,  difference  in  cultivation 
technique  of  vegetables  has  considerable  influence  on  income.  The  third  group 
represents  that  companies  are  highly-active,  and  “number  of  companies  per  pop- 
ulation=high”  belongs  to  this  group.  This  atom  represents  that  a  municipality 
is  located  close  to  cities,  each  of  which  gives  opportunity  of  shipment  and  side 
income.  The  fourth  group  represents  that  a  municipality  depends  mainly  on 
agriculture,  and  “ratio  of  secondary  industry =low”  belongs  to  this  group.  This 
atom  represents  that,  for  instance,  each  farmer  has  large  acreage.  This  analysis 
shows  that  each  atom  in  the  premise  nodes  in  figure  2  is  appropriate  as  a  reason 
of  “agricultural  income  per  farmhouse  =  high”. 

In  the  next  experiment,  the  atom  in  the  conclusion  is  settled  to  “agricul¬ 
tural  income  per  farmer  =  high”,  and  a  rule  set  which  consists  of  335  rules  is 
obtained  with  the  same  procedure.  Figure  3  shows  the  visual  graph  obtained  by 
our  method  with  the  same  conditions. 


Fig.  S,  Visual  graph  for  conclusion  “agricultural  income  per  farmer  =  high”  with  \S\  = 
15 
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In  Japan,  municipalities  of  “agricultural  income  per  farmer  =  high”  are  al¬ 
most  equivalent  to  municipalities  of  “agricultural  income  per  farmhouse  =  high”. 
Large-scale  farmhouses  are  dominant  in  these  municipalities.  Since  atoms  in  the 
premise  nodes  in  figure  3  are  similax  to  those  in  figure  2,  this  visual  graph  can 
be  validated  with  similar  discussions  as  above. 

In  the  last  experiment,  the  atom  in  the  conclusion  is  settled  to  “agricul¬ 
tural  income  per  lOA  =  high”,  and  a  rule  set  which  consists  of  319  rules  is 
obtained  with  the  same  procedure.  Figure  3  shows  the  visual  graph  obtained  by 
our  method  with  the  same  conditions. 


Fig.  4.  Visual  graph  for  conclusion  “agricultural  income  per 


lOA  =  high”  with  \S\  =  15 


Unlike  the  other  two  visual  graphs,  visual  graph  in  figure  4  has  “ratio  of 
living  area  =  high”,  and  considers  “ratio  of  forest  =  low”  as  more  important.  It 
should  be  also  noted  that  atoms  “ratio  of  secondary  industry=low”  and  “ratio  of 
production  generation=high”  have  disappeared.  These  results  can  be  explained 
that  some  of  municipalities  in  which  large-scale  farmhouses  are  dominant  are 
excluded  in  “agricultural  income  per  lOA  =  high”,  and  cultivation  techniques 
are  more  important  for  this  conclusion. 

Prom  figure  2  to  4,  each  obtained  visual  graph  has  a  simple  structure  and 
contains  valid  rules.  Domain  experts  evaluated  these  three  results,  and  claimed 
that  each  visual  graph  has  a  simple  structure  and  thus  has  high  readability. 
They  also  concluded  that  each  visual  graph  contains  accurate  and  valid  rules  in 
explaining  the  conclusion. 

5  Conclusion 

Existing  rule  discovery  methods  induce  a  huge  number  of  rules,  and  inspection  of 
these  rules  for  judging  their  usefulness  requires  considerable  efforts  for  humans. 
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In  order  to  circumvent  this  problem,  we  proposed  a  novel  method  for  transform¬ 
ing  a  discovered  rule  set  into  a  visual  graph  which  has  a  simple  structure  for 
representing  information  of  a  data  set.  For  this  transformation,  we  presented  a 
novel  criterion:  compressed  entropy  density  which  is  given  by  the  quotient  of 
compressed  entropy  by  the  description  length  of  the  graph.  Our  method  has 
been  applied  to  an  agricultural  data  set  for  1748  municipalities  in  Japan,  and 
the  results  were  evaluated  by  domain  experts.  Obtained  visual  graphs  have  high 
readability  and  contain  valid  rules  even  for  these  experts.  We  consider  that  this 
fact  demonstrates  the  effectiveness  of  our  method. 
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Abstract.  Association  rule  is  a  kind  of  important  knowledge  extracted 
from  databases.  However,  a  large  number  of  association  rules  may  be 
extracted.  It  is  difficult  for  a  user  to  understand  them.  How  to  select 
some  “representative”  rules  is  thus  an  important  and  interesting  topic.  In 
this  paper,  we  proposed  a  distance-based  approach  as  a  post-processing 
for  association  rules  on  numeric  attributes.  Our  approach  consists  of 
two  phases.  First,  a  heuristic  algorithm  is  used  to  cluster  rules  based 
on  a  matrix  of  which  element  is  the  distance  of  two  rules.  Second,  after 
clustering,  we  select  a  representative  rule  for  each  cluster  based  on  an 
objective  measure.  We  applied  our  approach  to  a  real  database.  As  the 
result,  three  representative  rules  are  selected,  instead  of  more  than  300 
original  association  rules. 

Keywords:  Association  rules,  Rule  clustering.  Rule  selection,  Numeric 
attributes,  Objective  Measures,  Discretization. 


1  Introduction 

Data  mining  has  been  recognized  as  an  important  area  of  database  research. 
It  discovers  patterns  of  interest  or  knowledge  from  large  databases.  As  a  kind 
of  important  pattern  of  knowledge,  association  rule  has  been  introduced.  An 
association  rule  is  an  implication  expression:  C\  =>  where  Ci  and  C2  are 
two  conditions.  It  means  that  when  the  condition  Ci  is  true,  the  conclusion  C2 
is  almost  always  true. 

Association  rule  is  first  introduced  in  Agrawal  et  al.’s  papers  [AIS93,  AS94]. 
They  considered  only  bucket  type  data,  like  supermarket  databases  where  the 
set  of  items  purchased  by  a  single  customer  is  recorded  as  a  transaction.  When 
we  focus  on  data  in  relational  databases,  however,  we  have  to  consider  var¬ 
ious  types  of  data,  especially  continuous  numeric  data.  For  example,  {age  6 
[40,60])  {awn-hemse  =  yes).  In  this  case,  we  may  find  hundreds  or  thou¬ 
sands  of  association  rules  corresponding  to  a  specific  attribute.  Fig.  1  shows  all 
rules  (about  300)  that  we  extracted  from  an  adult  database.  The  rules  have  the 


424 


form  ''fnlwgt  E  [a, 6]  =>  {income  <  bOK)",  where  fnlwgt  is  a  numeric  attribute 
and  income  is  a  decision  attribute.  We  order  the  rules  by  the  ranges  in  the  LHS. 
It  is  not  accteptable  to  show  all  rules  to  users.  To  tackle  this  problem,  Fukuda 
et.  al.  [FMMT96a,  FMMT96b]  proposed  so-called  optimized  association  rule. 
It  extracts  a  single  association  rule  from  all  candidates  which  maximizes  some 
index  of  the  rules,  for  example,  support.  In  many  cases,  however,  it  is  just  a 
common  sense  rule  and  has  no  value  at  all. 


f 
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Fig.  1.  Many  similar  rules  are  extracted 

To  overcome  this  shortcoming,  in  our  opinion,  it  is  reasonable  to  divide  the 
process  of  discovering  association  rules  into  two  steps:  one  is  to  find  all  can¬ 
didates  of  which  support  and  confidence  are  greater  than  the  thresholds  given 
by  users;  the  other  is  to  select  some  representative  rules  from  all  candidates. 
Although  most  of  existing  papers  contributed  to  the  first  step,  an  incremental 
interesting  has  been  paid  on  the  second  step  [KMR'*'94,  MM95,  Pre98,  GB98, 
Kry98a,  Kry98b,  WTL98].  Various  measures  for  interestingness  of  association 
rules  have  been  proposed. 

In  general,  the  evaluation  of  the  interestingness  of  discovered  rules  has  both 
an  objective  and  a  subjective  aspect.  Kiemettinen  et  al.[KMR''"94]  proposed  a 
simple  formalism  of  rule  templates  to  describe  the  structure  of  interesting  rules, 
like  what  attributes  occur  in  the  2intecedent  and  what  attribute  is  the  consequent. 
Liu  et  al.  [LHC97]  proposed  a  user-defined  impression  to  analyze  discovered  rules. 
Other  authors  choose  to  look  for  objective  measures  for  rule  selection.  Gago  et 
al.[GB98]  defined  a  distance  between  two  rules,  and  select  n  rules  such  that  they 
are  the  most  distinguished.  Major  et  al.[MM95]  proposed  a  set  of  measures, 
like  simplicity,  novelty,  statistical  significant,  and  a  stepwise  selection  process. 
Kryszkiewicz  [Kry98a,  Kry98b]  defined  a  cover  operator  for  association  rule  on 
bucket  data,  and  found  a  least  set  of  rules  that  covers  all  association  rule  by 
the  cover  operator.  However,  since  downward  closure  property  is  not  true  for 
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association  rules  on  numeric  attribute,  Cover  operation  is  not  appropriate  for 
rule  selection. 

In  this  paper,  we  focus  on  selection  of  association  rules  on  numeric  attributes. 
We  assume  that  a  set  R  of  association  rules  have  been  extracted.  We  then  se¬ 
lect  a  subset  of  as  representative  rules  of  R.  Our  approach  is  first  to  cluster 
association  rules  according  to  the  distance  between  rules,  and  then  to  select  a 
representative  rule  for  each  class.  In  this  paper,  we  also  focus  on  objective  mea¬ 
sures  for  association  rules.  We  observe  from  Fig.  1  that  many  similar  rules  exist. 
It  is  because  a  rule  candidate  which  is  close  to  a  rule  with  high  support  and 
confidence  is  most  possibly  an  association  rule  too.  Hence,  it  is  reasonable  to 
define  a  representative  rule  for  a  set  of  similar  rules.  Two  objective  measures  are 
proposed  for  clustering  and  selection  of  rules,  respectively. 

The  paper  is  organized  as  follows:  In  Section  2,  we  present  basic  terminology 
and  an  overview  of  the  work.  Section  3  defines  a  distance  between  rules  which 
is  used  for  grouping  similar  rules.  In  Section  4,  we  propose  a  coverage  measure 
for  selection  of  representative  rules.  In  Section  5,  we  present  some  experimental 
results.  Section  6  concludes  and  presents  our  future  work. 

2  Overview  of  Our  Work 

In  this  section  we  present  basic  terminology  for  mining  association  rules  on 
numeric  attributes,  and  then  give  an  overview  of  our  approach. 

Assume  there  is  a  relation  T)(Ai,  A2, •  •  • » C),  where  Ai  is  an  attribute 
name,  and  C  is  a  decision  attribute.  For  a  tuple  t  £  t,Ai  denotes  the  value  of 
Ai  at  t.  An  association  rule  is  an  expression  of  the  form  C\  =^(72,  where  Ci  and 
C2  are  two  expressions,  called  left-hand  side  (LHS)  and  right-hand  side  (RHS) 
of  the  rule,  respectively.  In  this  paper,  we  consider  association  rules  on  numeric 
attributes  with  the  form: 

R\{ai  <  Ai  <  61)  A  •  •  *  A  (ttn  <  An<  bn)  {C  =  yes) 

where  Ai  is  a  numeric  attribute  and  (7  is  a  Boolean  attribute.  Without  confusion, 
we  usually  denote  a  rule  by  an  area  P  in  the  n  dimension  space,  t  £  P  means 
(ai  <  t.Ai  <  61)  A  •  •  •  A  (on  <  t.An  <  &«)}• 

Two  measures,  support  and  confidence^  are  commonly  used  to  rank  associ¬ 
ation  rules.  The  support  of  an  association  i2,  denoted  by  supp(R),  is  defined  by 
\{t\t  £  P}\/\D\  It  means  how  often  the  value  of  A  occurs  in  the  area  P  as 
a  fraction  of  the  total  number  of  tuples.  The  confidence  of  an  association  rule, 
denoted  by  conf(R),  is  defined  by  \{t\t  £  P  A  t.C  =  yes)\l\{t\t  £  P}|.  It  is  the 
strength  of  the  rule. 

For  a  pair  of  minsup  and  minconf  specified  by  the  user  as  the  thresholds  of 
support  and  confidence,  respectively,  an  association  rule  is  called  “interesting”  if 
both  its  support  and  confidence  are  over  the  minimal  thresholds.  Let  Q  denote 
the  set  of  all  interesting  rules.  That  is  i?  =  {R\supp{R)  >  minsup  A  conf(R)  > 


^  or  \{t\t  £  P  Ate  =  ye3}\/\D\, 
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minamf}.  Our  purpose  is  to  extract  a  set  of  representative  rules  from  Q.  Our 
approach  consists  of  the  following  two  steps: 

(1)  Clustering.  We  define  a  distance  between  two  rules,  and  a  diameter  of  a  set 
of  rules  based  on  distance  of  rule  pairs.  Intuitively,  the  rules  in  Fig.  1  should 
be  clustered  into  three  groups. 

(2)  Selection.  For  each  cluster,  we  select  exactly  one  rule  as  its  representative 
rule.  We  define  a  coverage  for  each  rule.  It  measures  the  degree  of  a  certain 
rule  to  “cover”  all  others. 

In  the  following  two  sections,  we  discuss  these  two  aspects  respectively. 

3  Clustering  Association  Rules 

Let  /?  =  {ri,  •  •  •  ,rn}  be  a  set  of  association  rules.  Eaeh  rule  contains  an  area 
in  LHS.  We  denote  also  the  area  as  r*  without  confusion.  In  the  followings,  we 
use  the  word  “rule”  and  “area”  in  the  same  meaning. 

Definition  1 .  Let  ri  and  r2  be  two  rules.  The  distance  of  ri  and  r2  is  defined 
by  ^ _ 

dist{rur2)  =  yj ”  af (1) 

where  ri  =  •  •  •  ,an^  <  for  i  =  1, 2. 

In  this  definition,  we  view  the  left  and  right  terminals  of  a  range  on  a  numeric 
attribute  as  two  independent  parameters.  Thus  a  rule  can  be  represented  as  a 
point  in  a  2n  dimension  space.  The  distance  of  two  rules  is  defined  as  the  distance 
of  the  two  points  in  the  space. 

Definition  2  .  Let  C  —  be  a  set  of  rules,  r  G  C  be  a  rule.  A 

(average)  distance  of  r  to  C  is  defined  by 

dist{r^C)  =  Eriecdist{r^ri)  jm  (2) 

Definition  3  .  Let  Ci  and  C2  be  two  sets  of  rules.  The  (average)  distance  be¬ 
tween  Cl  and  C2  is  defined  by 

dist(C\,Ci)  =  STiec^.rj^C2<list{ri,rj)l{\Ci\ ■  \C2\)  (3) 

where  |(7i|  and  \C2\  are  the  numbers  of  rules  in  Ci  and  (^2,  respectively. 

The  diameter  of  a  cluster  is  the  average  distance  of  all  pairs  of  rules  in  the 
cluster. 

Definition  4  .  Let  C  =  {ri ,  •  •  • ,  rm}  be  a  set  of  rules.  A  diameter  of  C  is  defined 
by 


d(C)  =  Sri,rjecdist(ri,rj)/{m{m  -  1)) 


(4) 
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Definition  5  .  Let  C  =  {C\ ,  ■  •  ■ ,  (7*},  where  CiCQ.C  is  called  a  clustering  of 
j?  if  for  a  given  threshold  do,  the  followings  are  satisfied. 

1.  Ci  n  Cj  =  j) 

2.  d{Ci)  <  do 

3.  dist(Ci,Cj)  >  doj  (i  ^  j) 

This  definition  gives  a  basic  requirement  for  clustering.  Obviously,  the  further 
the  distance  between  clusters,  the  better  the  clustering.  In  other  words,  we  expect 
to  maximize  the  sum  of  the  distance  of  all  pairs  of  clusters.  However,  there  are 
0((n!)^/2”)  number  of  candidates  for  clusterings.  It  is  impossible  to  obtain  an 
optimized  clustering  by  a  native  aproach. 

In  this  section,  we  propose  a  heuristic  approach  to  construct  a  clustering. 
It  is  a  hill-climbing  algorithm  working  on  a  matrix  of  which  cell  represents  the 
distance  of  two  rules.  That  is 


D  =  (dist{ri,Tj))„xn 

We  always  select  two  rules  (or  two  sets  of  rules)  between  which  the  distance  is 
the  minimal.  Hence,  our  algorithm  consists  of  a  loop,  each  of  which  combines 
two  lines/columns  of  the  matrix  of  which  crosspoint  cell  has  the  minimal  value. 

While  combining  two  rules  (or  two  sets  of  rules),  we  have  to  recompute  the 
distance  between  the  combined  cell  and  the  other  rules.  The  following  properties 
can  be  used  for  this  increamental  recomputing.  They  can  be  derived  from  the 
definitions  of  diameter  and  distance,  and  Fig.  2. 

Property  6.  Let  Ci  =  {ri,-*-,rm},  C2  =  be  two  sets  of  rules. 

Assume  d{Ci)  =  di,  and  d{C2)  =  d2,  and  dist{C\,C2)  =  dist.  The  diameter  of 
C\  U  C2  can  be  evaluated  by  the  following  formula. 

d(Ci  UC2)  =  Er,.eC,uC5  dist(r,  s)/{(m  +  n){m  +  n  —  1)) 

(m  +  n)(m  -h  n  -  1) 

_  m(m  -  l)d((7i)  +  n(n  -  l)d(C'2)  +  (2mn)dist{C  1^02) 

(m  +  n)(m  +  n  —  1) 

_  m(m  -  l)di  +  n(n  -  l)d2  +  {2mn)dist 
{m  n){m  n  —  1) 

Property  7  .  Let  Ci  —  {ri,  •  •  •  ,r,„},  C2  —  {si, •  •  •  ,s„}  be  two  clusters.  Cz  be 
another  cluster.  Assume  Ci  and  C2  are  combined  to  a  new  cluster  Ci  11(72,  then 
the  distance  between  Cz  and  C\  U  C2  can  be  evaluated  by  the  following  formula. 

dist{C3,Ci  U  C2)  =  {Er€Cs,.ec,uC2  «iis«(»-,s))/(|C3|  •  \Ci  U  C2\) 

^  Er€C„.6C,  +  Er6C3..6C^  «) 

IC3I  -ICi  UC2I 

IC3I  •  |(7i|  ■  distjCoA)  +  |g3|  •  IC2I  •  dist(C3,C2) 

|C3|-|CiUC2| 

mdi  +  ndz 


m  +  n 
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where  d\  —  dist{Cz^C\)^  and  ^2  =  dist{C^^C2)- 


(1)  Diameter  of  C\  U  C2  (2)  Distance  between  C3  and  Ci  U  C2 
Fig.  2.  Diameter  and  distance  of  clusters 

The  algorithm  consists  of  a  loop  of  two  steps.  The  first  step  is  to  select  the 
minimal  distance  from  the  upper  triangle  of  the  matrix.  If  the  value  is  less  than 
the  threshold  do,  the  corresponding  two  rules  (clusters)  should  be  combined.  The 
next  step  is  to  generate  a  new  matrix  which  has  smaller  size. 


Algorithm  8 .  Clustering 
Input:  a  matrix 

Output:  clustering  C  =  {Ci,  •  *  • , 

Method: 

1.  Ci  =  {i}  for  i  =  1,  •  •  • ,  A:;  d  =  Assume  D{s,  t)  is  the  minimal 

distance  element  is  D. 

2.  While  (d  <  do)  Do  { 

2-1.  combine  Cs  and  Ct<,  and  let  the  new  Cg  be  Cg^Ct^ 

2-2.  delete  Ct  from  C. 

2-3.  generate  a  new  matrix  D'  =  (ei,j)(„_i)x(n-i)j  where 

_  Tis(ns  —  l)da,s  +  nt{nt  —  l)df,t  +  ^ngntdg^t 
(ris  4-  nt)(ns  -{-  nt  —  1) 

esj  =  {ng*dgj+nt*dtj)/{ng-\-nt),  j  ^ 

where  Ug  and  nt  are  the  size  of  the  5-th  and  t-th  clusters,  dij  is  the 
distance  between  Ci  and  Cj. 

2-4.  find  the  minimal  distance  from  D'\  Let  D'(s,  t)  =  j)}  =  d. 

3.  Output  C.  Assume  the  final  matrix  is  Then  the  diameter  of  cluster 

Ci  is  ei,i. 
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The  complex  of  this  algorithm  is  O(n^).  This  is  because  that  the  most  ex¬ 
pensive  step  is  finding  the  minimal  element  of  the  matrix  in  each  loop. 

Example  1 .  Let  us  consider  a  simple  example.  The  rules  contain  only  one  at¬ 
tribute  in  its  LHS.  That  is,  all  rules  can  be  represeted  as  a  range  in  this  case. 
Let  1?  =  {[1, 3],  [3, 5],  [2, 4],  [6, 7],  [7, 9]}.  The  distance  matrix  is 

0  2V^  >/4T  6v^\ 

0  y/2y/UAy/2 
0  5  5V^ 

0  y/b 
0  / 

Assume  that  the  threshold  do  =  2.  The  algorithm  runs  as  follows. 

1.  Find  £>1  (1, 3)  which  value  is  the  minimal  in  Di.  Since  the  value  (1, 3)  <  do, 
we  combine  the  first  and  the  third  line/column  at  first.  The  new  matrix  a 
4x4  one. 

y/2  {S/2)V2  (V?T  +  5)/2  (ll/2)x/2\ 

0  y/IS  4^2 

0  y/E 

0  / 

2.  In  the  new  matrix,  the  minimal  value  except  the  elements  in  the  diagonal 
line  is  £>2(1,2)  =  (3/2)  7(2)  <  do  Hence,  we  need  to  combine  of  the  first 
and  second  line/column  of  £>2.  The  reduced  new  matrix  £>3  is, 

/  (4/3) ( Vil  +  5  +  V^)/3  5v/2  \ 

=  (  0  v/5  j 

3.  Finally,  since  the  minimal  value  cell  £>3(2,3)  >  do,  the  algorithm  stops. 

i?  is  thus  divided  to  three  clusters.  One  is  {[1,3],  [3, 5],  [2, 4]},  and  the  others 
are  {[6,7]}  and  {[7,9]}. 

4  Selecting  Representative  Rules 

The  next  phase  of  our  approach  is  to  select  a  representative  rule  for  each  cluster. 
Since  all  rules  in  the  same  cluster  are  similar,  it  is  reasonable  to  select  only  one 
as  a  representative  rule. 

Definition  9  .  Let  C  =  {ri,---,rn}  be  a  cluster  of  rules,  and  R  G  C.  The 
coverage  of  to  (7  is  defined  as 

a{R)  =  {i:rec\\rnR\\/\\rUR\\)/\C\  (5) 

where  Ill’ll  is  the  volume  of  the  area  X.rUR  and  rHR  are  defined  in  an  ordinary 
way.  A  rule  R  is  called  representative  rule  of  C  if  a{R)  is  the  maximal. 
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The  measure  a(R)  reflects  the  degree  of  one  certain  rule  to  cover  all  others. 
It  can  be  used  as  an  objective  measure  for  selection.  In  the  following  section,  we 
can  see  from  an  example  that  this  measure  is  better  than  the  others  like  support. 

Example  2 .  Let  us  consider  Example  1  once  again.  For  cluster  {[1, 3],  [3, 5],  [2, 4]}, 
we  can  evaluate  that  a([l,3])  =  4/9,  a([3, 5])  =  4/9,  and  a([2,4])  =  5/9.  Hence, 
[2,4]  should  be  selected  as  the  representative  rule  of  the  cluster.  The  other  two 
clusters  are  single  element  clusters.  The  rule  itself  is  thus  the  representative  rule 
of  the  cluster.  Hence,  we  finally  obtain  a  set  of  representative  rules  for  i?.  It  is 
{[2, 4],  [6, 7],  [7, 9]}. 

It  is  easy  to  develop  an  algorithm  with  O(n^)  complexity  to  select  a  repre¬ 
sentative  rule  from  the  cluster  C  . 

5  Experiments 

The  first  experiment  is  to  apply  our  approach  to  analyse  a  set  of  association  rules 
extracted  from  an  adult  database.  The  association  rule  has  the  form  ^^fnlwgt  G 
[a,  6]  {income  <  50/i:)”.  The  RHS  of  the  rule  can  be  viewed  as  a  Boolean 
attribute.  The  database  contains  32560  tuples.  When  we  set  minconf  =  0.8  and 
minsup  =  0.03,  we  obtained  310  rules. 

In  the  first  step,  we  represent  these  rules  as  points  in  a  2D  space.  By  our 
algorithm,  they  formed  three  clusters  (Fig  3(a)).  Furthermore,  three  rules  are 
selected  from  three  clusters,  respectively.  The  representative  rule  of  the  cluster 
1  is  showed  in  Fig.  3(b). 


Fig.  3.  The  X-axis  and  Y-axis  represent  the  left  and  right  terminal  of  the  range  in  the 
LHS  of  a  rule,  respectively. 
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The  second  experiment  is  to  compare  our  coverage  measure  with  the  support 
measure  as  selection  metric.  We  consider  another  attribute  “age”  in  the  adult 
database  to  see  the  association  relation  between  “age”  and  “income”,  that  is, 
pattern  of  rule  ^^Age  G  [a,  6]  =>  Income  <  SO/f” .  Let  the  threshold  of  confidence 
Be  be  0.8.  Fig.  4  (a)  shows  the  range  which  support  is  the  maximal  and  confidence 
is  greater  than  Be.  Prom  the  figure  we  can  see  that  the  selected  range  covers  a 
large  part  of  which  confidence  is  less  than  Be.  It  is  because  that  the  left  part  of 
the  range  is  with  a  confidence  which  is  much  higher  than  the  Bc>  To  be  opposite, 
Figure  4  (b)  shows  the  range  of  which  coverage  is  the  maximal  and  its  confidence 
and  support  are  greater  than  the  given  thresholds. 


Confidence 


Fig.  4.  Comparison  of  the  measure  of  coverage  and  support 


6  Conclusions  and  Further  Work 

Selection  of  representative  and  useful  association  rules  from  all  candidates  is 
a  hard  problem.  Although  it  depends  on  user’s  interests  in  nature,  we  believe 
that  some  objective  measures  are  helpful  for  users  to  select.  For  association 
rules  on  numeric  attributes,  we  observed  that  there  exist  many  similar  rules. 
We  thus  propose  a  distance-based  clustering  algorithm  to  cluster  them.  The 
clustering  algorithm  is  a  heuristic  hill-climbing  and  matrix-reducing  procedure. 
The  complexity  is  O(n^),  where  n  is  the  number  of  association  rules.  We  also 
propose  an  objective  measure  called  coverage  for  selection  of  representative  rule 
for  each  cluster. 

Some  further  work  is  needed.  How  to  deal  with  attributes  with  different  types 
and/or  scales  in  the  LHS  of  the  rules  is  interesting.  Further  evaluation  of  the 
effectiveness  of  our  approach  in  real  applications  is  also  necessary. 
Acknowledgments  The  authors  would  like  to  thank  the  anonymous  reviewer 
who  provided  critical  and  detail  comments. 
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Abstract  Mining  for  common  motifs  in  protein  tertiary  structures  holds  the 
key  to  the  understanding  of  protein  functions.  However,  due  to  the  formidable 
problem  size,  existing  techniques  for  finding  common  substructures  are 
computationally  feasible  only  under  certain  artificially  imposed  constraints, 
such  as  using  super-secondary  structures  and  fixed-length  segmentation.  This 
paper  presents  the  first,  pure  tertiary-level  algorithm  that  discovers  the  common 
protein  substructures  without  such  limitations.  Modeling  this  as  a  maximal 
common  subgraph  (MCS)  problem,  the  solution  is  found  by  further  mapping 
into  the  domain  of  maximum  clique  (MC).  Coupling  a  MC  solver  with  a  graph 
coloring  (GC)  solver,  the  iterative  algorithm,  CRP-GM,  is  developed  to  narrow 
down  towards  the  desired  solution  by  feeding  results  from  one  solver  into  the 
other.  The  solution  quality  of  CRP-GM  amply  demonstrates  its  potential  as  a 
new  and  practical  data-mining  tool  for  molecular  biologists,  as  well  as  several 
other  similar  problems  requiring  identification  of  common  substructures. 


1.  Introduction 

This  paper  describes  a  new  algorithm  capable  of  discovering  maximal  common 
substructures  from  large,  complex  graph  representations  of  given  structures  of 
interest.  The  ability  to  produce  high-quality  solutions  in  reasonable  time  has  been  a 
long  standing  challenge,  since  the  maximal  common  subgraph  (MCS)  problem  is 
known  to  be  NP-hard.  Overcoming  the  size  limitation  of  current  pattern  discovery 
techniques  based  on  conventional  graph  theory  turns  out  to  be  even  more  significant. 
Finally,  the  algorithm  is  demonstrated  to  be  not  only  a  general,  useful  data-mining 
tool  but  also  an  effective  method  for  analysis  of  protein  structure,  and  function. 

In  recent  years,  molecular  biologists  have  been  devoting  their  efforts  on  the 
analysis  of  protein  structure  commonality.  It  is  of  great  interest  for  a  number  of 
reasons.  The  detection  of  common  structural  patterns  (or,  motifs)  between  proteins 
may  reveal  the  functional  relationships.  Moreover,  the  results  of  Jones  and  Thirup  [1] 
have  indicated  that  the  three-dimensional  structure  of  proteins  can  often  be  built  from 
substructures  of  known  proteins.  In  other  words,  the  mining  of  protein  motifs  may  in 
fact  hold  the  key  to  the  question  of  how  proteins  fold  into  unique  and  complicated  3D 
structures.  The  understanding  of  the  ‘protein  folding’  problem  will  further  contribute 
to  the  design  of  new  and  more  effective  drugs  with  specific  3D  structures. 
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A  number  of  automated  techniques  have  been  developed  for  this  purpose. 
Rosmann  et  al.  pioneered  the  technique  of  superimposing  two  proteins.  Approaches 
using  variations  of  structure  representation,  similarity  definition,  and  optimization 
techniques  have  been  deployed  [2,3,4].  Most  representative  among  these  techniques 
include  those  of  Grindley  et  al.[3],  and  Holm  and  Sander  [4].  Grindley  et  al.  pre- 
processed  the  protein  tertiary  structures  into  a  collection  of  coarser  representations, 
the  secondary  structures,  then  performed  maximal  common  subgraph  matching  on  the 
resultant  representations.  Holm  and  Sander  discarded  the  notion  of  secondary 
structure,  and,  instead,  pre-segmented  the  proteins  into  fixed-length  patterns.  Then  a 
Monte  Carlo  random  walk  algorithm  is  used  to  locate  large  common  segment  sets. 

All  the  aforementioned  techniques  are  subject  to  artificially  imposed  constraints, 
such  as  using  super-secondary  structures  and  fixed  length  segmentation,  which  could 
damage  the  optimality  of  the  solution.  This  paper  presents  a  new  maximal  common 
sub-graph  algorithm  that  overcomes  those  limitations. 


2.  Protein  Common  Substructure  Discovery  by  MCS 

Similar  3D  protein  structures  have  similar  inter-residue  distances.  The  most  often 
used  inter-residue  distance  is  the  distance  between  residue  centers,  i.e.  C“  atoms.  By 

using  the  inter- C"  distance,  the  similarity  can  be  measured  independent  of  the 
coordinates  of  the  atoms. 

The  similarity  of  two  proteins  Pi  and  Pi  tertiary  structures  can  be  defined  as, 

M  M 

/=i  j=i 

where  Mis  the  number  of  matched  C  atom  pairs  from  Pi  and  Pi,  and  ^(i,  j)  is  a 
similarity  measure  between  the  matched  pair  i  and  y,  which  is  defined  as  a  threshold 

step  function  that  outputs  1  when  d “  I  j)  ”  •/)  I—  ^ »  otherwise 

0.  This  removes  any  contribution  of  unmatched  residues  to  the  overall  similarity. 

Definition  1:  The  Protein  Common  Tertiary  Substructure  (PCTS)  Problem  is 
defined  as  that  of  maximizing  similarity  measure  of  eq.  (1),  seeking  the  maximum 

number  of  matched  C"  atom  pairs  satisfying  the  distance  measure. 

2.1  Maximal  Common  Subgraph  Approach 

In  recent  years,  graph  matching  algorithms  have  been  liberally  used  to  perform 
protein  structure  analysis  (such  as  the  work  of  Grindley  et  al.  [3]). 

Definition  2:  A  graph  G(F,£)  is  defined  as  a  set  of  vertices  (nodes),  V,  together 

with  a  set  of  edges,  E,  connecting  pairs  of  vertices  inF(£cKxF).  A  labeled 
graph  is  one  in  which  labels  are  associated  with  the  vertices  and/or  edges. 

The  protein  structures  can  be  easily  represented  as  labeled  graphs.  For  the  purpose 
of  PCTS  problem,  proteins  are  considered  labeled  graphs  with  vertices  being  the 
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C"  atoms,  and  edges  labeled  with  the  C  “ -to- C"  distances  between  the  vertices. 
Then  the  largest  common  substructures  between  two  proteins  is  simply  the  maximal 
common  sub-graph  (MCS)  isomorphism  problem: 

Definition  3:  Two  graphs,  Gi  and  G2,  are  said  to  be  isomorphic  if  they  have  the 
same  structure,  i.e.  if  there  is  a  one-to-one  correspondence  or  match  between  the 
vertices  and  their  (induced)  edges.  A  common  sub-graph  of  Gi  and  G2,  consists  of  a 
sub-graph  Hi  of  Gi,  and  a  subgraph  H2  of  G2  such  that  Hi  is  isomorphic  to  Hj. 

The  flexibility  allowed  by  the  similarity  measure  can  be  easily  incorporated  into 
this  graph  theoretical  approach  for  solving  the  PCTS  problem.  For  example,  the  angle 
or  bond  rigidity  in  the  protein  geometry  could  be  relaxed.  The  similarity  measure, 
then,  only  needs  to  allow  a  looser  edge  label  and  the  distance. 


2.2  Transforming  to  Maximum  Clique  Problem 

Brint  and  Willett  [5]  performed  extensive  experiments  in  the  80’s  and  concluded  that 
the  MCS  problem  can  be  solved  more  effectively  in  the  maximum  clique  domain, 
which  can  be  done  by  using  the  following  transformation. 

Definition  4:  A  clique  is  a  complete  graph.  The  Maximum  Clique  (MC)  Problem 
is  to  find  the  clique  with  the  maximum  number  of  nodes  in  a  given  graph. 

[Transforming  from  MCS  to  MC]  Barrow  et  al.  [6]  gave  a  transform  to  convert 
MCS  into  the  MC  problem  by  the  following  procedures: 

Given  a  pair  of  labeled  graphs  Gi  and  G2,  create  a  correspondence  graph  C  by, 

1)  Create  the  set  of  all  pairs  of  same  labeled  nodes,  one  from  each  of  the  two  graphs. 

2)  Form  the  graph  C  whose  nodes  are  the  pairs  from  (1).  Connect  any  two  node 
pairs  Ni(Ai,  B,),  N2(Aj,By)  in  C  if  the  labels  of  the  edges  from  Aj  to  Aj  in  Gi  and 
B,  to  By  in  G2  are  the  same. 

Solving  the  MCS  problem  becomes  that  of  finding  the  maximum  clique  of  C  and  then 
map  the  solution  back  into  a  MCS  solution  by  the  inverse  transformation. 


3.  Algorithms 


3.1  Exploiting  the  Relations  Between  MC  and  Graph  Coloring 

Both  problems  of  maximal  common  subgraph  and  maximum  clique  are  NP-hard. 
Numerous  MC  algorithms  have  been  developed  over  the  years.  However,  their 
solution  quality  tends  to  vary  significantly  from  test  case  to  test  case,  mainly  because 
they  are  mostly  heuristic  algorithms  trying  to  solve  a  multi-dimensional  optimization 
problem  with  local  optima  “traps”.  Another  NP-hard  problem  of  graph  coloring  (GC) 
is  tightly  coupled  with  MC  in  an  iterative  loop  aiming  to  converge  to  the  optimal 
solution  of  either  problem,  or  in  many  cases  both.  In  this  section,  only  the  most 
relevant  parts  to  the  MC-GC  solver  are  included,  leaving  other  details  in  [7].  The 
algorithmic  framework  of  the  MC-GC  solver  is  shown  in  Figure  1. 

Given  a  graph  G(V,E),  the  relation  between  MC  and  GC  is  fundamentally 
expressed  by  the  following  well-known  theorem: 
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Fig.  1.  Algorithmic  Framework  for  CRP-MCS 


Theorem  1:  Given  the  size  of  the  maximum  clique,  cd{G)  ,  and  the  chromatic 

number,  A(G),  then  o)(G)<  X(G)<(A  +  l),  where  A  is  the  maximum  degree  of 
G.[8] 

With  the  chromatic  number  and  maximum  clique  size  bounding  each  other,  it 
provides  a  perfect  termination  condition  for  the  loop  process  shown  in  Figure  1.  If 
such  situation  occurs,  then  the  optimal  solutions  for  both  problems  are  solved 
simultaneously. 

In  order  to  devise  a  set  of  heuristics  for  clique-finding  and  graph  coloring,  the 
following  definitions  and  theorems  are  utilized. 

Definition  5:  Given  a  coloring  for  G,  the  color-degree  of  vertex  V/,  cdeg(v,),  is 
defined  as  the  number  of  different  colors  of  the  adjacent  nodes,  the  color-vector  of 
vertex  V/,  cv(v/),  is  defined  as  the  set  of  colors  that  V/  can  use  not  conflicting  with  the 
colors  assigned  to  its  adjacent  nodes. 

Lemma  1:  For  any  set  V  of  vertices,  let  the  size  of  the  maximum  clique  that 
includes  5 be  (o(G\V)^  Then,  fi)(G|  F)<(cdegfO+|  K|).  (Proof  omitted  here.) 

Definition  6:  A  set  D  dV  is  defined  to  be  dominant  if 
Vv  e  (V\  D),3u  e  D  (u,v)  e  E .  Given  a  complete  coloring  C  for  G,  for 
any  color  c,  {v  |  c  e  cv(v)}  forms  a  set  of  dominant  Sc  vertices.  The  color  that 
corresponds  to  the  smallest  Sc  is  called  the  (minimal)  dominant  color. 

Assuming  that  the  graph  has  uniform  probability  for  the  edge  connection,  then  the 
probability  of  a  vertex  in  any  dominant  set  can  be  derived  as  follows. 

Theorem  2:  Given  a  random  graph  (V,E),  where  the  graph  size  is  n,  the  edge 

probability  for  each  pair  of  vertices  e(u,v)  =  p,  and  a  specific  maximum  clique  is  CO , 
for  a  complete  coloring  C  for  G,  if  the  minimal  dominant  vertex  set  is  5^.  then 
\fvGS^,  the  probability  that  v  belongs  to  a  clique  of  size  0)  is 


ASc\ 


('1 


Therefore,  selecting  a  vertex  from  the  smallest  dominant  vertex  set  means  a  higher 
probability  for  it  to  be  in  the  maximum  clique.  This  then  underlies  the  strategy  of 
using  a  GC  solution  as  an  initializing  “seed”  for  the  MC  computing  process. 
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Definition  7:  When  coloring  a  graph,  the  color  reduction  of  node  V|  is  defined  as 
the  process  of  removing  colors  from  cv(Vf)  in  conflict  with  the  colors  of  all  of  its 
neighbors. 

Graph  coloring  is  generally  accomplished  by  sequentially  assigning  colors  to 
uncolored  vertices.  The  risk  of  such  sequential  decision  process  is  that  once  a  vertex 
is  colored  with  color  c  when  there  is  more  than  one  choice,  due  to  color  reduction,  the 
adjacent  vertices  are  forced  to  use  the  other  available  colors.  Consequently,  the 
coloring  solution  could  be  misdirected  away  from  the  optimal  due  to  premature  color 
decisions.  The  color  reduction  process  is  used  in  this  work  precisely  to  prevent 
premature  commitments  in  the  effort  of  achieving  minimal  coloring. 

Definition  8:  A  Saturated  Clique  (SC)  is  defined  as  a  clique  cl  whose  size  is  equal 

to  the  union  of  all  node  color  vectors,  i.e.,  |  U  cv(v)  |=|  cl  \ . 

veci 


3.2  Solving  MCS  via  MC  (and  GC) 

Based  on  the  observation  of  the  close  relations  between  graph  coloring  and  maximum 
clique  problems,  a  complementary  algorithm,  CRP-MCS,  that  combines  graph 
coloring  and  clique-finding  algorithms  is  designed  to  solve  the  maximum  clique 
problem.  A  resource  management  methodology  [9],  called  Constrained  Resource 
Planning  (CRP),  provides  the  guiding  principles  and  motivates  the  solution  strategies 
for  both  the  coloring  and  clique-finding  processes  of  the  iterative  loop.  Solution  from 
one,  and  its  derived  information,  is  used  to  initialize  the  counterpart  process,  and 
execute  altematingly  until  a  solution  is  found.  Such  an  initialization  process  is  called 
^seeding"  in  this  work.  Each  sub-algorithm  terminates  upon  completion  of  its  targeted 
goal  and  then  hands  over  the  result  to  the  other.  The  entire  iterative  process 
terminates  when  certain  criteria  are  met,  and  the  maximal  clique  solution  is 
transformed  into  a  MCS  solution. 


3.2.1  Clique-Finding  Algorithm 

Each  graph  coloring  process  produces  different  color  distribution.  Since  our  clique- 
finding  algorithm  relies  on  the  coloring  information,  the  coloring  result  C  comes  from 
previous  coloring  process  naturally  becomes  the  seed  for  clique-finding.  The  set  of 
nodes  that  use  the  dominant  color  is  set  to  be  the  seed,  or  pivot  vertices,  for  the 
clique-finding  process,  and  large  cliques  are  sought  in  Nbr(y)  for  each  pivot  v. 

In  addition,  for  any  clique  in  the  graph,  each  color  contributes  at  most  one  vertex. 
Moreover,  once  a  vertex  is  chosen  to  add  into  a  temporary  clique,  vertices  that  do  not 
connect  to  it  have  to  be  disregarded,  thus  may  result  in  some  colors  being  disregarded 
without  contributing  any  vertex.  Therefore,  it  is  highly  desirable  to  preserve  as  many 
colors  as  possible  during  the  process  of  searching  for  a  clique.  Similar  to  the 
principle  of  selecting  the  pivot  vertices  above,  the  color  that  contains  fewest  vertices 
is  chosen.  Then  within  the  selected  color,  the  vertex  v  that  has  highest  color  degree  is 
selected  and  added  into  the  temporary  clique. 

The  clique-finding  algorithm  is  summarized  as  follows. 


Algorithm  CLIQUE -FINDING  (input:  coloring  C,  largest 
clique  found  cl°) 

1:  Let  Bu =  UpperBound (G,  C)  .  If  |  cl®  |  =  Bu , terminate . 

2:  Locate  dominant  color  c,  set  Pivot  node  set  P 

-{v  I  cv(  v)  >  (B  u  -  1),  V  e  color  c\  \  c’|=|  c  |} 

3:  For  each  p  in  P, 

Set  G'  =  {v  I  V  6Nbr(p)}.  cl=NULL,  set  tmp-cl=:  [p] 

While  |tmp-cl|  >0, 
bcdeg  =  MaxCDEG(GM 

select  c  that  {v|  vGG^/v:i(v)=CAcde^v) >(bcdegl)| 

If  ties,  select  c  that  MAX  ^deg(v) 

v,cv(v)=c 

Pick  node  v  from  c  that  MAX(cdeg(v)) 

If  ties,  pick  V  that  MAX  \  {e(u,w)  \  u,w  e  Nbr(v)}  \ 

If  InHash([v,  tmp-cl] ) ,  select  another  node 
Set  tmp-cl  =  [v,  tmp-cl] ,  add  tmp-cl  into  HASH 

Set  G'=G^-Nbr(y) 

If  |G'  I  =0,  Call  BackTrackO 

3.2.2  Coloring  Algorithm 

As  discussed  earlier,  color  reduction  (CR)  plays  an  active  role  in  the  GC  algorithm.  It 
not  only  helps  to  reduce  the  solution  space  by  removing  conflicted  colors  from  the 
color  vectors,  but  also  assists  to  reach  the  chromatic  number  and  decide  the 
convergence  of  the  solution. 

Theorem  4:  [Coloring  Lower  Bound  Bj}  Given  a  graph  G,  assume  that  a  coloring 
initializes  with  k  colors.  If  during  performing  pure  color  reduction,  there  is  any,  (1) 

node  with  zero  size  cv,  or  (2)  clique  cl  and  m=  U  V  ■  (v  •  G  cl)  <|  c/  | ,  then  G 

1  to  |c/| 

needs  at  least  (^+1)  colors  {k^m-\ct^  if  (2)  is  the  case). 

Moreover,  since  co{G)  <  X{G) ,  the  lower  bound  for  any  graph  coloring  process 
would  be  the  maximum  of  the  Bl  derived  from  previous  coloring  process  and  the 
largest  clique  that  was  found  in  earlier  clique-finding  process. 

Because  that  the  largest  clique  found  in  earlier  clique-finding  processes  may  in  fact 
be  the  new  lower  bound  for  graph  coloring,  it  is  treated  as  the  ‘seed’  for  a  new 
coloring  process.  Specifically,  for  the  largest  clique  cl  with  size  k,  each  vertex  in  cl  is 
assigned  with  a  unique  color  from  1  through  k. 

In  order  to  perform  color  reduction  by  using  the  concept  of  saturated  cliques  (SC), 
a  set  of  cliques  needs  to  be  identified.  Since  the  proposed  algorithm  is  an  iterative 
process,  all  the  cliques  found  by  clique-finding  process  and  stored  in  the  hash  can  be 
utilized.  The  more  cliques  collected,  the  higher  chance  of  more  SC’s  for  color 
reduction,  thus  postpones  unnecessary  forced  coloring.  This  could  lead  to  the  use  of 
fewer  colors  for  coloring  the  entire  graph,  A  supplement  algorithm  designed  for  this 
purpose  is  described  in  [1 1]. 

For  any  state  of  the  coloring,  vertices  that  have  smaller  color  vectors  tend  to  have 
less  chance  of  being  assigned  colors  on  them.  In  order  to  avoid  overuse  too  many 
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colors,  it  is  critical  to  process  these  vertices  as  earlier  as  possible.  Therefore,  these 
vertices  are  regarded  as  the  most  constrained  tasks  to  be  accomplished.  Meanwhile, 
although  there  is  already  fewer  choices  than  the  others  for  coloring  these  vertices, 
careless  assigning  color  would  still  result  in  overuse  of  colors.  Thus,  each  color  from 
the  color  vector  needs  to  be  examined  to  determine  which  would  have  least  impact  on 
the  rest.  The  impact  is  evaluated  by  the  reduction  of  color  vector  sizes  of  uncolored 
neighbor  set,  then  the  color  with  least  impact  would  be  assigned  to  the  vertex. 

The  complete  algorithm  is  shown  below. 

Algorithm  COLORING  {input:  graph  G,  hash  memory:  HASH) 

1.  Set  Bl=MAX(Bl,  max  (HASH)) 

2.  Vv  E  G  ,  assign  v  a  color  vector  cv{v)-{l  ...  Bl} 

3 .  Color  the  largest  untried  clique  cl  in  with  1~ | cl | 

4.  Perform  color  reduction,  and  update  Bl  by  Theorem  4 

5.  while  there  is  a  node  uncolored. 

Select  a  vertex  v  with  MIN(|cv(v) \) 

If  ties,  select  v  that  MIN  '^{u\ueNbr(y)AC€  cMu) }  [ 

CGCV(V) 

select  c  e  cv(v)  that  MIN  \{u\ue  Nbr(v)  a  c  e  cv(v)}  | 

If  ties,  select  c  that  \  CV(v)  |  after  CR. 

U 

Perform  color  reduction. 


3.3  Generic  Pre-Processing/Dynamic  Accessing  Strategy  for  Large  Problems 

Although  the  MC  solution  provides  an  effective  means  for  solving  MCS,  the  (9(w  V) 
space  requirement  for  storing  the  adjacent  matrix  for  the  connection  information  is 
simply  too  large  for  applications  like  the  protein  common  tertiary  substructure 
problem.  Discarding  the  adjacent  matrix  and  re-compute  the  adjacency  between 
vertices  could  alleviate  the  space  consumption,  however,  it  still  requires  0(m  n ) 
computation  time  for  visiting  all  the  possible  connections,  which  is  now  taking  more 
time  since  it  needs  to  be  recalculated. 

A  generic  pre-processing/dynamic  accessing  technique  is  developed  in  this  work  to 
handle  such  situation.  For  the  convenience  of  discussion,  it  is  described  for  the 
protein  substructure  problem,  but  it  can  be  extended  to  a  more  general  context  easily. 

Assume  that  Pi  of  size  m  and  P2  of  size  n  are  the  two  proteins  to  be  explored.  The 
dominant  subroutine  and  needs  to  be  repeatedly  performed  during  the  computation  is 
to  finding  the  adjacent  vertex  pairs.  To  be  more  specific,  given  that  V;  in  Pi  and  V2  in 
P2  is  paired  (matched),  it  is  crucial  to  determine  which  vertex  pair  in  the 
correspondence  graph  is  compatible  with  vertex  pair  (vy,  v^).  Namely,  to  find  out  all 

vertex  pairs  (Wy  >  )  >  W,  E  ,  Uj  €  P2  5  Wy )  —  d(y2  5  ^ j  )  I—  d threshold  • 

The  complexity  for  a  specific  vertex  pair  alone  is  0(mnX  and  grows  to  0(m^n^)  if  the 
connections  for  all  vertex  pairs  need  to  be  re-computed.  When  the  problem  is  small 
enough  to  fit  in  the  primary  memory  space,  this  can  be  done  by  simply  a  table  look-up 
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at  the  adjacency  matrix  for  the  correspondence  graph.  However,  the  space 
consumption  is  too  expensive  for  the  protein  common  tertiary  problems. 

Instead  of  searching  through  the  mn  vertex  pairs  repeatedly,  the  complexity  can  be 
reduced  by  pre-sorting  all  vertices  in  P2  with  respect  to  each  vertex  in  P2  in  ascending 
order  in  terms  of  edge  labels  (distances).  The  complexity  of  the  sorting  can  be  done 
with  0{nhogn\  which  needs  to  be  done  only  once,  with  the  cost  of  additional  0{n^) 
space  to  the  storage  for  the  protein  itself,  rather  than  the  expensive  Each 

time  to  find  the  adjacent  vertex  pairs  for  vertex  pair  (v;,  v^),  it  can  be  simply  done 
dynamically  with  the  following  algorithm: 

Algorithm  DynamicFindAdjacentPairs  (input:  Vi,  V2) 

1.  Set  L=0/  S=sorted_list  ( vj 

2.  For  each  vertex  v(V^Vj)in  Pj, 

d=distance  (vj,  v) ,  l==d-dthreshoidt  u=d-i-dti,re^t,^id 
(s, e) =RangeSearch (I,  u,  S) 

For  i=s  to  e,  L={L  |  index  (V2,  J)  }  .  Return  L 


(a)  (b)  (c) 


(d)  (e)  (f) 


Fig.  1.  (a)  The  backbone  view  of  Synaptotagmin  (Irsy),  and  (b)  Fibronectin  Type  III  domain 
(Ifha)  (c)  The  similar  structures  between  (a)  Irsy  and  (b)  Ifha  after  alignment.  (71  pairs 
matched  with  r.m.s.d.  =  1.742632  Angstrom)  Red  :  Similarity  from  (a)  Irsy,  Blue  :  Similarity 
from  (b)  Ifha  .  (d)  The  backbone  view  of  Hen  egg-white  lysozyme  (llyz),  and  (e)  T4  phage 
lysozyme  (21zm)  (f)  The  similar  structure  structures  between  (d)  llyz  and  (e)  21zm  after 
alignment.  (106  pairs  matched  with  r.m.s.d.  =  3.923571  Angstrom)  Red:  Similarity  from  (e) 
llyz,  Blue  :  Similarity  from  (f)  21zm 
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4.  Experiments 


A  set  of  protein  files 
[6]  frequently 

referenced  in  the 
molecular  biology 
literature  is  selected 
to  test  the  CRP-MCS 
algorithm.  The 
protein  sizes  range 
from  108  to  497 

atoms.  They  are 
Hen  egg-white 
lysozyme  (llyz),  T4 
phage  lysozyme 
(21zm),  actinoanthin 
(lacx),  superoxide 
dismutase  (Icob), 
tumor  necrosis  factor 
(Itnf),  methylamine 
dehydrogenase 

(2mad),  defensin  (Idfii),  neurotoxin  (Ishl),  fibronectin  cell-adhesion  module  type  III- 
10  (Ifiia),  and  synaptotagmin  I  (Irsy).  The  program  is  implemented  in  C  and  run  on 
an  SGI  Onyx(RlOOOO)  machine  using  single  processor.  Each  experiment  on  a  given 
pair  of  proteins  is  set  with  10  min.  time  limit,  and  the  best  result  produced  is  used  to 
align  the  two  proteins  and  derive  the  error  measurement. 


Protein  Common  Tertiary  Substructure 
Discovery  Using  CRP-MCS  and  DALI 


20  40  60  80  100  120 

No.  of  Matched  Pairs 


Fig.  3.  Comparison  of  the  protein  common  tertiary  substructure 
discovery  between  using  DALI  and  proposed  CRP-MCS.  Each 
circled  area  represents  one  of  the  six  experiments. 


The  solution  quality  is  measured  with  (a)  number  (N)  of  matched  C“  atom  pairs, 
and,  (b)  the  root-mean-square  deviation  (r.m.s.d.)  which  is  defined  as  , 


where  di  is  the  distance  between  the  i-th  pair  of  C“  atoms.  The  results  are  compared 
against  those  from  the  DALI  web-server  (http://www2.embl-ebi.ac.uk/dali/),  which  is 
an  implementation  of  Holm  and  Sander’s  work  (4).  The  time  cut-off  of  10  minutes 
for  structures  with  a  homolog  in  the  representative  set  seems  to  be  sufficient. 

Two  typical  alignment  results  are  shown  in  Figure  2,  where  the  optimally  aligned 
protiens  using  the  discovered  largest  common  structures  are  shown.  The  number  of 

matched  atom  pairs  and  the  r.m.s.d.  values  are  also  included. 

There  are  totally  six  experiments  conducted.  They  are  (1)  Idfii  vs.  Ishl,  (2)  Ifiia 
vs.lrsy,  (3)  lacx  vs.lcob,  (4)  lacx  vs.ltnf,  (5)  lacx  vs.  2mad,  and  (6)  llyz  vs.  21zm. 
The  results  obtained  by  submitting  to  DALI  server  and  by  the  CRP-MCS  alprithm 
are  plotted  in  Figure  3,  where  each  experiment  is  shown  in  a  circled  area.  Since  the 
DALI  server  usually  provides  only  one  solution  for  each  submission,  the  comparison 
is  made  by  setting  corresponding  threshold  parameters  in  the  CRP-MCS  solution  such 
that  either  the  no.  of  matched  pairs  or  the  r.m.s.d.  as  close  to  the  DLAI  one  as 
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possible.  As  shown  in  the  comparison  chart,  the  CRP-MCS  algorithm  resulted  in 
better  solutions  in  all  six  experiments.  Analysis  on  the  corresponding  pair 
information  shows  that  the  solutions  contain  fragments  that  can  not  be  detected  by 
fixed-length  approaches.  Therefore,  this  tool  is  expected  to  provide  more  flexibility 
and  optimality  for  analyzing  the  protein  structures. 

5.  Discussions 

Maximal  common  sub-graph  approach  has  been  of  great  interest  to  many  areas  due  to 
its  ability  to  extract  the  largest  common  substructure,  the  ease  to  mapping  the  problem 
and  the  flexibility  to  incorporating  various  constraints.  However,  it  has  been  limited 
to  small-size  problems  due  to  the  lack  of  efficient  algorithms,  both  in  space  and  time 
complexity.  This  algorithm,  CRP-MCS,  solves  the  MCS  problem  in  the  maximum 
clique  domain  using  MC  and  GC  as  complementary  solvers.  Although  the  problem  is 
NP-hard,  it’s  shown  to  reach  near  optimal  solutions  for  a  spectrum  of  benchmark 
cases  [7]  in  reasonable  time.  Moreover,  a  generic  pre-processing  and  dynamic¬ 
accessing  technique  is  developed  to  circumvent  the  space/time  overhead.  The  result 
is  shown  to  be  a  new  and  effective  data-mining  tool  for  discovering  the  large  common 
substructures  between  proteins  at  pure  tertiary  level  for  the  first  time. 

The  tool  allows  fully  free  matching  among  the  atoms  for  the  protein  problem. 
As  demonstrated  in  the  experiments,  it  has  the  capability  to  find  near-optimal 
common  substructures  that  are  not  possibly  to  be  detected  by  conventional  techniques 
that  use  pre-defined  patterns  or  segments.  Thus  it  provides  the  molecular  biologists 
more  flexibility  to  discover  the  common  tertiary  substructures. 

Structural  similarity  is  an  important  yet  difficult  data-mining  problem.  The  CRP- 
MCS  algorithm  presented  here  is  shown  to  successfully  bring  the  graph-based 
approach  to  an  important  real-world  problem  and  is  expected  have  more  applications 
in  assisting  the  discovery  of  new  knowledge  in  other  related  areas  also  [7]. 
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Abstract  Concept  lattice  is  an  efficient  tool  for  data  analysis.  In  this 
paper  we  show  how  classification  and  association  rule  mining  can  be 
unified  under  concept  lattice  framework.  We  present  a  fast  algorithm  to 
extract  association  and  classification  rules  from  concept  lattice. 

1.  Introduction 

Concept  lattice,  also  called  Galois  lattice,  was  first  proposed  by  Wille[l].  A  node  of 
concept  lattice  is  a  formal  concept,  consisting  of  two  parts:  the  extension  (examples 
the  concept  covers)  and  intension  (descriptions  of  the  concept).  Concept  lattice  gives 
a  vivid  and  concise  account  of  relations  (generalization  /specialization)  among  those 
concepts  through  Hasse  Diagram. 

Classification  rule  mining  and  association  rule  mining  are  two  important  data 
mining  techniques.  There  are  already  some  classification  systems  based  on  concept 
lattice.  Empirical  evaluation  shows  that  concept  lattice  based  systems  have 
comparable  performance  with  those  typical  systems  such  as  C4.5  [5].  Association  rule 
mining  is  a  hot  research  topic  in  data  mining  recently.  Some  authors  have  shown  that 
concept  lattice  is  a  nature  framework  for  association  rule  mining  [4].  In  this  paper  we 
would  show  that  concept  lattice  is  an  appropriate  tool  for  integrating  association  and 
classification  rule  mining.  Some  author  also  discussed  the  topic  [2],  But  we  argue  that 
concept  lattice  embodies  the  relationships  between  concepts  in  a  more  understandable 
way.  Therefore  it  is  very  interesting  dealing  with  the  task  under  the  context  of  concept 
lattice. 

2  Basic  Notions  of  Concept  Lattice 

In  this  section  we  recall  necessary  basic  notions  of  concept  lattice  briefly,  the  detail 
description  can  be  found  in  [1]. 

Suppose  given  the  context  (O,  D,  R)  describing  a  set  O  of  objects,  a  set  D  of 
descriptors  and  a  binary  relation  R,  there  is  a  unique  corresponding  lattice  structure, 
which  is  known  as  concept  lattice.  Each  node  in  lattice  L  is  a  pair,  noted  (X,  Y), 
where  XeP(0)  is  called  extension  of  the  concept,  YeP(D)  is  called  intension  of 
concept.  Each  pair  must  be  complete  with  respect  to  R.  i.e.: 

(1)  X={ieO|VyeY,yJ&:}!  (2)  Y  =  {yeC  I  VxeX,  yilx} . 

A  partial  order  relation  can  be  built  on  all  concept  lattice  nodes.  Given  H,=(  X,, 
Y,)  and  H2=(  X2,  Y2),  let  H|<  H2  o  Y,c  Y2,  the  precedent  order  means  H,  is  a  direct 
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parent  of  H2.  The  Hasse  diagram  of  the  lattice  can  be  generated  using  the  partial  order 
relation.  If  Hj<H2and  there  is  no  other  node  H3  such  that  H,<H3<H2_  there  is  an  edge 
from  Hj  to  Hj. 

Below  is  an  example  of  context  and  corresponding  lattice  and  Hasse  diagram. 


In  our  algorithm  a  node  in  lattice  is  denoted  by  (C=|X|,  Y),  as  content  of  X  does 
not  matter.  Now  it’s  easy  to  see  if  C  is  bigger  than  some  threshold  t,  X  is  a  maximal 
large  item  set. 

Implication  rules  can  be  derived  from  concept  lattice.  Rule  Q=:>R  holds  if  and 
only  if  the  smallest  concept  (intent)  containing  Q  also  contains  R[6]. 

3  Building  the  Lattice 

In  order  to  reduce  the  number  of  nodes  in  lattice,  it  is  necessary  to  introduce  a  support 
threshold.  We  adapt  Bordat’s  algorithm  [3]  by  introducing  a  support  threshold  s  and 
making  other  minor  improvement.  Because  lattice-constructing  algorithm  only  find 
only  maximal  itemsets,  hence  they’re  much  faster  than  Apriori  algorithm. 

The  lattice  L  is  initialized  with  topmost  node  (|O|,0)  and  expanded  by 
constructing  its  subnode  recursively.  In  the  algorithm  we  use  an  array  of  pointer  PX  to 
keep  track  of  first  appearance  of  all  single  attr-val  pair  in  the  lattice.  This  structure 
will  be  used  later  in  the  association  rule  mining.  Once  the  support  of  a  node  is  found 
lower  than  e,  the  node  will  no  longer  be  expanded.  We  improved  the  original 
algorithm  by  utilizing  counting  information.  That  is,  instead  of  checking  whether 
extensions  of  two  attr-val  pair  set  are  identical,  we  check  whether  the  count  of  either 
extension  is  equal  to  the  count  of  extension  of  the  union  of  the  two  attr-val  pair  set. 
Experiments  show  that  it  is  about  five  time  faster  than  original  algorithm  (when  s  is 
set  to  0).  The  lattice  built  by  this  algorithm  is  in  fact  a  “frequent”  lattice,  i.e.  it 
contains  only  those  nodes  whose  support  is  greater  than  8.  Thus  the  algorithm  reduces 
the  complexity  of  building  the  complete  lattice. 
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4  Rule  extraction  from  the  lattice 

In  this  section  we  present  an  algorithm  which  generates  all  non-redundant  rules  for  a 
given  item  set  (set  of  attr-val  pair)  as  right  hand  side  (RHS).  We  first  find  the  smallest 
node  containing  the  item  set,  then  launch  a  breadth-first  traverse  to  its  sub-lattice. 
From  each  node  we  generate  all  non-redundant  rule.  Because  the  way  we  build  the 
lattice,  the  support  of  all  the  rule  generated  are  greater  than  minimal  support.  We  first 
produce  all  rule  whose  confidence  is  1  (i.e.  implication  rules)  and  then  appropriate 
lower  confidence  rules.  The  way  we  generate  (implication)  rule  relies  on  following 
observations. 

Observation  1  If  a  node  H=(C,  X)  has  only  one  parent  node  P=(C’,  X’),  then 

(1)  The  left  hand  side(LHS)  of  rules  generated  from  H  consists  of  a  single  item. 

(2)  For  each  attr-val  pe  {X-X’},  there  is  a  rule  p=>X-p. 

Suppose  the  LHS  of  a  rule  generated  fi*om  H  consists  of  more  than  one  item 
(attr-val).  If  all  these  item  are  also  included  in  X’,  they  must  have  been  already  treated 
in  parent  P  because  of  our  top  down  traverse  fashion;  if  there  exists  an  item  pe  {X-X’} 
in  LHS,  the  rule  is  redundant  with  respect  to  p=>X -p,  the  latter  is  simpler. 
Observation  2  If  a  node  H=(C,  X)  has  d  parent  nodes  Pi(Ci,  Xi),  ^2(^2^  “’5 

Pd(Cd,  Xd),  there  is  a  rule  p=>X-p  for  each  item  pe  {X-(XiUY2U...uXd}. 

Because  any  item  pe{X-(XiUX2U...uXd}  is  the  first  time  appearing  in  the 
lattice,  it  is  obvious  its  confidence  is  1.  Any  rules  whose  LHS  strictly  include  p  is 
redundant  with  respect  to  p=>X-p. 

Observation  3  If  a  node  H=(C,  X)  has  two  parent  nodes  Pi(Ci,  Xj),  P2(C2,  X2),  Vp,e 
{Xi-XinXj}  and  Vp2€  {X2“XinX2},  there  is  a  rule  PiP2=>X-piP2. 

That  is  because  if  there  are  two  items  coming  fi*om  the  same  parent,  their 
relationship  must  have  been  described  before.  So  Any  rules  whose  LHS  strictly 
include  piP2  would  be  redundant  with  respect  to  p,p2=>X-piP2. 

Observation  3  can  be  generalized  to  the  case  of  any  number  of  parent  nodes. 

In  the  algorithm,  we  adopt  a  heuristic  search  strategy.  If  an  item  set  can  not  form 
a  implication  rule,  it  is  saved  in  a  candidate  set.  In  the  next  loop,  all  items  in  the 
candidate  set  are  joined  in  an  Apriori-like  manner.  Then  new  candidates  are  tested 
against  whether  they  can  form  an  implication  rule. 

As  to  rules  whose  confidences  are  below  1,  we  use  a  data  structure  PX  (see 
previous  section)  to  aid  computing  confidence.  PX  pomts  to  first  appearance  of  every 
single  element  of  LHS.  Function  PX(lhs)  finds  the  first  appearance  of  LHS  and  thus 
its  support.  If  LHS  of  such  rule  is  included  in  another  rule,  it  will  be  discarded  since 
longer  LHS  rule  have  higher  confidence. 

The  computation  depends  heavily  on  judging  whether  several  elements  are 
included  in  a  common  parent.  We  introduce  a  bit  vector  V  to  do  the  judgement 
efficiently.  Every  element  in  the  node  has  a  bit  vector.  If  the  element  also  appears  in  a 
parent,  the  corresponding  bit  will  be  set.  Thus  any  combination  of  those  elements  can 
be  judged  by  simple  and  fast  AND  operations. 

Rule  Extraction  Algorithm  for  specific  RHS 
1 .  GenRule(itemset  rhs) 
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2.  Find  first  node  H  containing  rhs  by  breadth-first  traverse 

3.  queue<-H,  ruleset'<-0,  singleset<-0 

4.  while  queue  not  empty 

5.  Remove  H  from  queue  head,  push  all  children  of  H  into  queue  tail 

6.  if  H  not  visited  then  {  GenRuleFromNode(H);  mark  H  as  visited} 

7.  endwhile 

8.  GenRuleFromNode(H=(C,X)) 

9.  d<-number  of  parents  of  H:  (C,,Xi). .  .(Cd,Xd) 

10.  if  (d=  =1)  {ruleset=rulesetu{p^rhs  |  pe  {X-Xj} } ;retum;} 

11.  for  every  parent  of  H  compute  their  union  S  and  generate  array  V. 

12.  ruleset=rulesetu{p^rhs,  |  pe  {X-S} }; 

13.  ruleset=rulesetu{p->rhs,  conf=||PX(p)||/||H|l  |  peS,  not  sameparent(p,  rhs),  p 
not  generated  before} 

14.  L^{SluS2|SleS,S2eS} 

1 5 .  while  L  not  empty 

16.  S<-0 

17.  for  every  element  K  in  L 

18.  if  each  item  in  K  aren’t  included  in  same  parent  and  not  Rsnileset  that  RcK 

19.  ruleset=rulesetu{K“>'rhs,  sup=C/||0||,  conf=l} 

20.  else 

21.  S^SuK 

22.  if  not  sameparent(K,rhs)  then  ruleset=Tulesetu{K-^rhs,  conf=||PX(K)||/||H|| } 

23.  endif 

24.  endfor 

25.  L<-{SluS2|  SleS,  S2€S,  ||SluS2|M|Sl||+l} 

26.  endwhile 

27.  return  ruleset; 

Line  14-27  generates  rules  whose  LHS  contain  more  than  one  item  in  an 
Apriori-like  manner.  The  algorithm  is  written  according  to  above  observations.  The 
rules  generated  are  sorted  by  confidence  (larger  to  smaller).  If  confidence  is  same, 
higher  support  would  be  first.  The  classification  is  done  by  matching  the  new  instance 
against  every  rule  from  begin  to  end.  If  no  rule  fires,  then  the  majority  class  is  used. 
When  building  the  lattice,  class  attribute  values  are  treated  as  an  ordinary  attribute 
and  are  added  to  the  lattice.  The  rule  extraction  algorithm  is  run  a  number  of  times  by 
assigning  every  class  attribute  value  as  parameter  value.  Then  all  rules  are  collected 
together  performing  the  classification. 

5  Experiments  and  Conclusions 

We  implement  our  algorithm  and  do  the  comparison  using  MLC++.  First  we  did  some 
preliminary  test  on  lattice  constructing  algorithm  and  found  it  much  faster  than 
Apriori.  This  is  because  the  algorithm  produces  only  maximal  large  item  sets.  Thus 
the  comparison  between  them  doesn't  seem  to  have  much  meaning. 

In  this  section  we  mainly  present  the  result  of  comparing  C4.5  and  our  algorithm. 
We  use  10  datasets  form  UCI  Repository  for  the  comparison.  In  our  experiment. 
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minimum  confidence  is  set  to  0.6  and  minimum  support  is  set  to  0.01.  Our  experiment 
is  done  on  a  PC  with  64Mb  PII 233  running  windows  98.  Our  algorithm  is  referred  as 
CLACF  (Concept  Lattice  based  Association  and  Classification  rules  mining 


Framework] 

).  The  discretization  is  done  using  entro 

py  method  in  MLC++. 

CLACF 

Rule  time 

No.  of  rules 

Lattice  time  | 

Breast 

5.0 

3.8 

20.1 

307 

Diabetes 

25.8 

27.8 

2.14 

35 

Glass 

31.3 

18.9 

0.88 

49 

Heart 

19.2 

16.6 

3.51 

528 

11.1 

Iris 

4.7 

4.0 

0.0 

19 

0.0 

Led7 

26.5 

23.7 

0.83 

278 

8.0 

Monkl 

19.0 

9.0 

0.22 

127 

Monk2 

30.1 

19.9 

0.38 

169 

Monk3 

8.3 

5.1 

0.28 

113 

Pima 

24.5 

27.1 

1.0 

25 

0.4 

Average 

18.5 

15.6 

2.93 

165 

5.43 

Column  2  and 

column  3  show  error  rates  of  C4.5  and 

CLACF.  CLAC 

outperforms  C4.5  in  8  out  of  10  datasets,  and  has  an  average  error  rate  of  15.6,  which 
is  lower  than  18.5  of  C4.5.  Column  4  to  6  give  the  execution  time  of  the  two 
algorithms  and  the  number  of  rule  generated  respectively.  We  can  see  the  algorithm 
produces  relatively  smaller  set  of  rules  comparing  with  [2]  while  retaining  accuracy. 

In  this  paper  we  propose  a  framework  to  integrate  classification  and  association 
rule  mining  based  on  concept  lattice.  We  adapt  an  existing  lattice  constructing 
algorithm  to  generating  a  ‘frequent’  lattice  and  present  an  efficient  algorithm  to 
produce  association/classification  rules  from  the  lattice.  In  our  future  work,  we  will 
focus  on  developing  faster  algorithm  by  further  exploring  the  relationship  stored  in 
the  concept  lattice. 
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Abstract  We  conader  the  problem  of  discovering  the  conceptual  clusters  from 
a  large  database.  From  Z.  PaA^dak’s  information  system  in  rou^  set  theory,  we 
define  an  information  matrix,  information  mappings  and  some  concq)ts  in  data 
mhiing  literature  such  as  large  sets,  association  rules  and  concqitual  cluster.  We 
propose  a  combined  method  of  information  matrix,  Kohonen-s  neural  network 
for  large  set  discovery  and  genetic  algorithm  for  conceptual  cluster  validity.  We 
present  an  application  of  our  method  to  a  student  database  for  discovering  the 
rules  contributing  to  the  training  of  the  gifted  students. 


1  Introduction 

Data  Mining  (DM)  is  to  discover  the  interesting  patterns  present  implicitly  in  large 
database  [7].  In  this  paper,  we  study  the  problem  of  conceptual  cluster  discovery  fi'om 
a  large  database.  This  problem  is  stated  as:  given  a  set  of  objects,  conceptual 
clustering  discovery  is  to  find  clusters  of  objects  based  on  a  conceptual  closeness 
among  objects  [1],[2],[3],[4].  We  proposed  a  method  for  solving  and  expanding  this 
problem.  Based  on  Z.  Pawlak’s  information  system  [9],  we  define  an  information 
matrix  and  some  concepts  then  we  employ  a  combined  Kohonen’s  self-organizing 
algorithm  (SOA)  and  Genetic  algorithm  for  conceptual  cluster  discovery  and  building 
rules  fi:om  these  discovered  concepts.  We  build  an  information  matrix  in  the  computer 
memory  for  improving  the  speed  of  mining  process.  The  p^r  is  organized  as 
follows.  Section  1:  Introduction.  Section  2:  Formal  definitions.  Section  3:  Problem 
statement.  Section  4:  Using  SOA  for  discovering  large  descriptor  sets.  Section  5; 
Using  GAfor  cluster  validity.  Section  6:  An  application  to  a  student  database.  Section 
7:  Conclusions  and  future  works. 


2  Formal  definitions 

In  this  section,  we  define  an  information  matrix  and  some  concepts  related  to  our 
proposed  method.  Based  on  these  definitions,  we  implement  a  set  of  functions  for 
processing  the  mining  tasks  in  the  computer  memory  instead  of  scanning  the  wliole 
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database  in  disk.  Therefore,  we  can  improve  significantly  the  speed  of  mining 
process. 

2.1  Definition  1:  Information  matrix 

Information  matrix  is  defined  as  B=(0,D)  where  C>={oi,,..,On}  is  a  finite  set  of  n 
objects  and  I>={di,...,<tn}  is  a  finite  set  of  m  descriptors.  Let  by  (i=l,...,n  and 
j=l,. .  .,m)  be  the  element  of  matrix  B,  by— 1  if  oi  has  ,  otherwise  bij=0. 


2.2  Definition  2:  Information  mappings 

Given  a  finite  set  O  of  n  objects  and  a  finite  set  D  of  m  descriptors  [5].  Let  P(D)  be  a 
power  set  of  D,  P(0)  be  a  power  set  of  O.  Information  mapping  %  is  defined  as: 

X:  D  -^{0,1}.  Given  oeO  and  deD,  x(o,d)  =  1  if  o  has  d,  otherwise  x(M)=0. 
Mappings  p  and  X  are  defined  as:  p:  P(D)  P(0)  and  %\  P(0)  ->P(P)  where: 

Given  S  c  D  then  p(S)  =  {o  €  O:  VdeS  ,  x(o,d)  =1} 

Given  X  c  O  then  A,(X)  =  {d  e  D:  VoeX,  x(o,d)  =  1} 


2.3  Definition  3:  Large  descriptor  set 

Given  an  information  matrix  B=(0,D)  and  a  threshold  t  which  is  the  MINSUP  of  the 
large  item  set  in  data  mining  litemture[7].  A  l^ge  descriptor  set  S  is  a  subset  of  D  that 
satisfy  condition:  Card(p(S))/Card(0)>=T,  where  Card  is  the  cardinality  of  set. 


2.4  Definition  4:  Binary  association  rule 

Given  an  information  matrix  B=(0,D)  and  a  threshold  t.  Let  S  be  a  large  descriptor 
set  of  B.  Let  Li ,  Lj  be  the  subsets  of  S.  A  binary  association  rule  with  threshold  t  is  a 

mapping  from  Lj  to  Lj  and  is  denoted  as  Li  — >•  Lj . 

2.5  Definition  5:  Confidence  factor  of  a  binary  association  rule 

Let  S  be  a  large  descriptor  set  of  B,  Li ,  Lj  be  the  subsets  of  S,  Li  Lj  be  a  binary 
association  rule  with  a  threshold  t.  The  confidence  factor  CF(Li  — >•  Lj)  of  this  rule  is 
calculated  by  Card(p(Li)np(  Lj) )  /  Card(p(Li)). 


2.6  Definition  6:  Concept 

A  concept  is  a  pair  C=  (X,S)  where  XcO  and  SqD.  X  and  S  satisfy  following 
conditions: 

a)  Xcp(S)and>.(X)  =  S 

b)  V  Li ,  Lj  c  S  and  Card(Li)  =  Card(Lj) =1  then  p(  Li)  c  p(  Lj). 
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3  Problem  statement 

Problem  1:  Given  an  infonnation  matrix  B  and  a  threshold  t,  find  all  large  descriptor 
sets  of  B.  The  large  descriptor  set  determines  the  popular  descriptors  of  data  objects. 
The  threshold  t  determines  a  measure  of  popularity  [7] . 

Problem  2:  Given  an  information  matrix  B  and  a  threshold  t,  find  k  conceptual 
clusters  Ci,...,Cfc  where  Cj  =  (Xj,  Sj).  These  conceptual  clusters  satisfy:  a)  rXi  =  0 
for  i=l,...,k  ;  b)  nSi  =  0  for  i=l,...  Jc ;  c)  CardpCi)/Card(0)>=T;  d)  Maximize  the 
ratio  Card(Xi  u...LXi)/Card(0)  e)  Q  is  a  concept.  Conceptual  cluster  determines  an 
object  set  that  has  the  same  set  of  descriptors.  Based  on  the  concept  C=(X,S),  we 
build  rule  Li  ->  Lj  where  Li^^Lj  =S  and  =0.  It  means  that  if  object  has  all  the 
descriptors  of  Li  (rule  antecedent)  then  object  has  all  the  descriptors  of  Lj  (rule 
consequent). 


4  Using  SOA  for  discovering  large  descriptor  sets 

In  this  section,  we  employ  SOA  for  discovering  the  potential  large  descriptor  sets  [6]. 
SOA  can  be  summarized  as  follows: 

Step  1.  Initialize  all  weight  vectors  of  Kohonen’s  neural  network 
Step  2.  Select  the  node  with  minimum  distance  dv  to  the  input  vector  v(t). 
Step  3.  Update  weight  vectors  of  nodes  that  lie  within  a  nearest  neighbor  set 
of  the  node  (ic  Jc):  Wy(t+ 1)  -  Wy(t)  +  a(t)(v(t)-Wy(t)  ) 
for  ic-Nc(t)  <=  i  <=  ic+Nc(t)  and  jc-N,(t)  <=  j  <=  jc+N«(t) 

Step  4.  Update  time  t  =  t+1,  add  new  input  vector  and  go  to  (Step  2) 

In  the  above  algorithm,  dv  is  Euclidean  distance,  a(t)  is  a  gain  ratio  (0<==a(t)<=l)  and 
Nc(t)  is  the  radius  of  neighbor  set.  Nc(t)  and  a(t)  are  decreased  monotonically  with 
time.  The  algorithm  finishes  when  a(t)  =0  or  Nc(t)=0. 

Given  an  information  matrix  in  table  1,  each  row  of  this  matrix  corresponds  to  an 
input  vector  of  Kohonen’ s  neural  network. 


Table  1.  An  information  matrix  for  large  descriptor  set  discovery. 


dl 

d2 

a 

d4 

d5 

d6 

ol 

1 

1 

1 

0 

0 

0 

o2 

1 

1 

1 

0 

0 

0 

o3 

1 

1 

1 

1 

0 

0 

o4 

0 

0 

1 

1 

1 

1 

o5 

0 

0 

0 

1 

1 

1 

o6 

0 

0 

0 

1 

1 

1 

After  running  SOA,  we  have  the  potential  large  descriptor  sets: 
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{di,  d2,  dsh  {<i4,  d5,  d^},  {d,,  db,  dg,  di}.  With  x=50%,  {di,  da,  da},  {d4,  ds,  d^}  are 
large  descriptor  set;  {di,  da,  da,  d^}  is  not  a  large  descriptor  set  because  Card(p({di, 
da,  da,  d4}))/Card(0)=33.3%<  t. 


5  Using  GA  for  cluster  validity 

Large  descriptor  sets  discovered  by  SOA  are  used  for  building  the  initial  GA 
population.  We  hold  that  the  subset  of  a  large  descriptor  set  is  also  a  large  descriptor 
set  [7].  Let  L={Li,...,Lk}  be  a  set  of  k  large  descriptor  sets,  we  employ  GA[8]  for 
finding  a  set  {Si,...,Sk}  where  Si  c  Li  (i=l,,..,k)  and  (Si,p(si))  is  a  concept.  A 
chromosome  is  a  set  of  BSi,  each  BSj  is  a  bit  string  corresponding  to  a  large  descriptor 
set.  With  two  large  descriptor  sets  {di,  da,  da}  and  {di,  de},  we  have  chromosome 
{di:l,  da:l,  da:l,  di:l,  d^O,  de'A}.  The  genetic  representation  of  population  P  is  a  set 
of  chromosomes.  A  typical  population  P  with  3  chromosomes  is  as  follows: 

P(t)=  {111111, 1001 1 1,001 100}.  The  genetic  operations  are  defined  as: 


5.1  Crossover  operator 

Given  two  parental  chromosomes:  (ai,  aa,  aa,  a4,  a5,  ae}  and  (bi,  ba,  ba,  b4,  bj,  b^} 
where  aj,  biG{0,l}(i=l,...,6).  The  crossover  will  swap  a  portion  of  two  parental 
chromosomes  and  yield  the  offspring:  {ai,  aa,  aa,  b4,  bj,  be  }and  {bi,  ba,  ba,  a4,  a^,  ae}. 


5.2  Mutation  operator 

Given  a  chromosome  (ai,  a,,  as,  ae}.  Select  a  random  position  he[1..6].  Let  h 
be  the  selected  position,  if  a^  =  1  then  ^is  changed  to  0  and  vice  versa. 


5.3  Fitness  factor  and  fitness  value 

Fitness  factor  S^:  Let  Sij  be  a  subset  of  chromosome  BSi,  we  build  set  Q  containing 
all  two-element  subsets  of  Sij.  Let  (a,  b}  be  an  element  of  Q.  From  (a,  b},  we  build 
two  rules  {a}  ->  {b}  and  {b}  {a}  and  calculate  the  CFs  of  these  rules.  The  Fitness 

factor  of  Sij  is  the  average  of  CFg  of  2xCard(Q)  rules  which  are  built  up  from  Q. 
Fitness  value  of  a  chromosome  BSi  is  the  average  of  fitness  factor  of  all  Sij  in 
chromosome  BSi. 


6  An  application  to  a  student  database 

We  employ  our  proposed  method  for  discovering  the  conceptual  clusters  from  a 
student  database.  An  information  matrix  with  1000  rows  and  100  columns  is  built  up 
from  this  database.  In  this  matrix,  each  row  corresponds  to  a  record  and  each  column 
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corresponds  to  a  descriptor.  Some  descriptors  of  the  information  matrix  are  “parent  of 
student  are  teachers”;  “student  is  ranked  in  good  level  of  learning”;  “student  wins  a 
prize  of  computer  science  competition” 

The  size  of  Kohonen’s  output  layer  is  100x100.  With  the  threshold  t=0.7  (70%),  we 
discover  some  large  descri^or  as  {student  wins  a  prize  of  a  math  competition;  student 
is  interested  in  math};  (student  is  ranked  in  good  level  of  learning,;  parents  of  student 
are  teachers};  {student  is  interested  in  math;  student  is  interested  in  foreign  language; 
student  is  interested  in  computer  science} . 

We  employ  the  following  values  for  GA  parameters:  number  of  chromosomes  is  50; 
number  of  generations  is  300;  crossover  probability  is  0.1;  mutation  probability  is  0.1. 
The  GA  give  us  some  discovered  conceptual  clusters  as  {student  is  ranked  in  good 
level  of  learning;  student  has  good  behavior;  parents  of  student  are  teachers;  Student 
has  the  self-learning  time  greater  than  6  hours  every  day};  {student  is  interested  in 
math;  student  is  interested  in  foreign  language;  student  is  interested  in  computer 
science};  {student  lives  in  country;  income  of  student  family  is  lower  than  $100  every 
month;  student  is  ranked  in  fair  level  of  learning}. 


7  Conclusions  and  future  works 

We  gathered  some  preliminary  result  in  using  a  combined  information  matrix,  GA 
and  SOA  for  cluster  discovery  in  data  mining.  The  experiment  shows  very  encourage 
in  large  data  set.  A  matrix  expressed  in  bit  is  also  used  for  keeping  the  whole 
information  matrix  in  main  memory  to  increase  the  efficiency  of  conceptual  cluster 
discovery.  We  continue  to  study  how  to  change  binary  information  matrix  to  fuzzy 
information  matrix  and  use  fiizzy  cluster  discovery  for  the  fuzzy  database. 
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Abstract.  We  propose  a  knowledge  discovery  process  for  multi-factor 
portfolio  management  on  a  financial  decision  support  system.  We  first 
construct  an  OPen  Intelligent  Computing  System  (OPICS)  to  support 
time  series  management  and  knowledge  management.  A  system,  Cyclone, 
which  efficiently  supports  financial  applications,  is  developed  under  the 
OPICS.  We  then  introduce  a  data  mining  solution  for  equity  portfolio 
construction  using  the  simulated  annealing  algorithm.  Two  data  sets  con¬ 
sist  of  small  stocks  ranging  from  11/86  to  10/91  and  from  6/93  to  5/96 
are  used.  The  corresponding  rates  of  return  of  Russell  2000  index  are 
collected  as  benchmarks  for  evaluation  based  on  the  Sharpe  ratios  and 
•  the  turnover  ratios.  The  result  shows  that  the  simulated  annealing  algo¬ 
rithm  outperforms  both  the  market  index  and  the  gradient  maximization 
method. 

1  Introduction 

Competition  in  the  investment  business  is  intense  and  increasing.  Historically, 
investment  managers  who  actively  select  stocks  employ  a  team  of  analysts  who 
understand  various  industries,  visit  companies,  and  utilize  quantitative  tech¬ 
niques  to  learn  important  information  to  help  them  recommend  which  stocks  to 
own.  While  the  Internet  technology  and  data  availability  are  rapidly  changing, 
quantitative  techniques  also  become  more  sophisticated.  The  data  warehousing 
and  data  mining  techniques  are  thus  acquired  across  the  financial  services  as 
well  as  the  banking  industry. 

We  propose  a  knowledge  discovery  process  for  multi-factor  portfolio  manage¬ 
ment  on  a  financial  decision  support  system.  The  process  consists  of  two  con¬ 
struction  phases;  the  knowledge  management  functionality,  and  the  data  mining 
solution.  We  first  construct  an  OPen  Intelligent  Computing  System  (OPICS) 
to  support  time  series  management  and  knowledge  management.  A  system,  Cy¬ 
clone,  is  developed  under  the  OPICS.  The  Cyclone  is  designed  to  allow  power 
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users  to  adjust  parameters,  simulate  portfolios  easily  and  efficiently,  and  even¬ 
tually,  to  share  knowledge  with  other  power  users.  We  then  introduce  a  data 
mining  solution  for  equity  portfolio  construction  using  the  simulated  annealing 
algorithm.  Two  data  sets  consist  of  small  stocks  ranging  from  11/86  to  10/91  and 
from  6/93  to  5/96  are  used.  The  corresponding  rates  of  return  of  Russell  2000 
index  arc  collected  as  benchmarks.  The  evaluation  is  based  on  the  Sharpe  ratios 
as  well  as  the  turnover  ratios.  The  result  shows  that  the  simulated  annealing 
algorithm  outperforms  the  market  index  as  well  as  the  gradient  maximization 
method. 

2  A  Motivating  Example 

The  process  of  constructing  a  portfolio  involves  defining  a  universe  of  stocks, 
dealing  with  data  integrity  issues,  selecting  an  appropriate  construction  tech¬ 
nique,  determining  the  way  of  using  training  and  testing  data,  setting  up  the 
construction  rules  and  constraints,  and  selecting  a  publicly  available  benchmark 
for  evaluating  the  test  results.  The  whole  process  is  known  as  the  back-testing 
model  in  the  financial  investment  society.  Schock  and  Brush  in  [1]  show  an  exam¬ 
ple  in  managing  a  small-stock  portfolio,  which  exploits  the  above  process.  The 
first  step  is  to  construct  a  sequence  of  monthly  universe,  by  ranking  a  database 
on  market  capitalization,  then  eliminating  the  largest  1,000  stocks  and  using  the 
next  600  excluding  limited  partnerships,  investment  trusts  and  stocks  with  very 
low  trading  volumes. 

The  second  step  is  to  include  values  and  other  proven  measures  to  the  uni¬ 
verse.  The  common  factors  are  earnings  to  price  ratio,  book  value  to  price  ratio, 
cash  flow  to  price  ratio,  volatility  adjusted  price  momentum,  etc.  The  rest  of 
work  is  to  apply  a  decision  model  to  construct  monthly  portfolios  through  time 
and  compute  the  corresponding  returns.  The  resulting  returns  of  portfolios  are 
then  used  to  compare  with  those  of  the  selected  benchmarks.  To  optimize  the 
flow  of  information  and  ”  knowledge- worker-to-knowledge- worker”  interaction  so 
that  companies  can  make  better  trading  decisions,  the  specific  data  and  model¬ 
ing  results  should  then  be  shared  and  managed.  This  is  the  essence  of  knowledge 
management. 


3  Knowledge  Management 

A  successful  knowledge  management  means  that  our  data  processes  enhance 
the  way  people  work  together,  enable  knowledge  workers  and  partners  to  share 
information  easily  so  that  they  can  build  on  each  other’s  ideas  and  work  more 
effectively  and  efficiently.  Though  data  for  financial  applications  are  simple  data, 
the  data  typically  includes  time  series.  The  empirical  research  based  on  time  se¬ 
ries  thus  is  a  data  intensive  activity  that  needs  a  knowledge  management  system 
with  data  and  time  modeling  capabilities,  computational  intelligence  and  perfor¬ 
mance  functionalities  (see  Schmidt  and  Marti  [2],  Dreyer,  Dittrich,  and  Schmidt 

[3]). 
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We  develope  a  time-series  management  system,  OPICS,  specifically  for  se¬ 
curity  investment  research.  The  OPICS  is  a  component-based  system  enhanced 
with  distributed  processing  capability.  To  effectively  manage  the  data  sets  de¬ 
rived  by  security  firms’  knowledge  workers,  a  time  series  data  set  is  organized 
into  two  parts,  the  header  and  the  data.  The  header  is  a  meta-data  about  the 
time  scries  basis  and  its  derived  data  sets.  We  also  develop  a  time  scries  manage¬ 
ment  system  (TSMS),  Cyclone,  under  the  OPICS  platform  to  enable  knowledge 
workers  in  a  security  research  firm  to  share  their  idea  and  models  with  others. 
See  Lu  and  Cheng  in  [4]  and  [5]  for  the  inter-modules,  the  system  architecture, 
as  well  as  other  functionalities. 

4  Data  Mining  for  Multi-factor  Portfolio  Construction 

Building  a  multi-factor  excess  return  model  to  select  a  portfolio  stocks  becomes 
a  widely  used  tool  for  portfolio  management.  We  consider  the  portfolio  con¬ 
struction  model  as  a  global  optimization  problem.  To  control  risk  throughout 
a  portfolio  construction  process,  we  consider  the  model’s  objective  function  as 
to  maximize  the  return  over  risk.  The  optimization  algorithms  to  be  examined 
in  this  paper  are  the  gradient  maximization  method  (see  Brush  [6])  and  the 
simulated  annealing  technique  (see  kirkpatrick  [7]). 

We  use  the  Sharpe  excess  return  (see  Sharpe  [8])  to  illustrate  the  concerns 
of  portfolio  managers  who  seek  for  high  returns  with  low  risks.  We  apply  the 
simulated  annealing  algorithm  in  conjunction  with  the  above  integrated  re¬ 
turn/risk  portfolio  model.  Since  the  values  of  the  Sharpe  excess  return  can  be 
either  positive  or  negative,  we  write  the  dynamic  rule  in  a  symmetrical  form 
E  =  1/(1  Q^harpeExetBaReiurn^^  which  is  referred  as  an  energy  measure.  For 
each  given  temperature,  the  thermal  equilibrium  can  be  described  by  a  proba¬ 
bility  distribution  function  with  respect  to  the  occurrence  of  a  state  with  energy 
E,  that  is, 


P{E)  =  (1) 

where  Z(r)  is  a  normalized  constant,  E  is  the  energy  of  the  state,  P{E)  is  the 
probability  of  finding  a  unit  in  that  state,  T  is  the  temperature,  and  Kb  is  the 
Boltzmann’s  constant.  Let  AE  be  the  change  of  the  energy  E.  The  traditional 
hill-climbing  algorithm  only  accepts  AE  when  it  is  less  than  zero,  while  the  sim¬ 
ulated  annealing  algorithm  allows  positive  AE  being  accepted  with  probability 
greater  than  0.  The  process  of  the  simulated  annealing  involves  three  steps:  the 
generation  of  a  new  state,  the  acceptance  criterion  of  the  new  state,  and  the 
condition  of  the  cooling  schedule,  which  will  be  briefly  discussed  as  follows. 

Generating  a  new  state  for  the  purpose  of  portfolio  optimization  is  to  ex¬ 
plore  the  current  state  region  around  the  current  weighting  combination.  By 
sequentially  changing  each  factor’s  weight  up  and  down  a  bit  from  the  original 
weighting  combination,  local  search  determines  a  combination  of  factor  changes 
that  identifies  a  best  direction  of  improving  the  portfolio  return.  Having  estab¬ 
lished  a  direction  of  local  search,  we  next  make  a  bigger  step  to  weight  changes 
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l^ble  1.  Portfolio  construction  using  data  during  11/86  -  10/91 


Data  Set  v  M 

(11/86-  10/91)  Return 

Becnmark 

R,eturn 

Excess 

R,eturn 

Sharpe  Turnover 
Ratio  Ratio 

1 

-11.31% 

-18.04% 

6.73% 

0.24 

8.14% 

2 

6.46% 

3.21% 

3.24% 

0.18 

8.01% 

GM 

3 

11.73% 

8.64% 

3.09% 

0.18 

7.48% 

4 

1.01% 

-0.08% 

1.69% 

0.09 

6.76% 

5 

8.02% 

6.97% 

1.05% 

0.05 

6.16% 

1 

-9.95% 

-18.04% 

8.09% 

0.32 

8.97% 

2 

7.21% 

3.21% 

4.00% 

0.26 

9.20% 

SA 

3 

13.06% 

8.64% 

4.42% 

0.23 

10.43% 

4 

2.01% 

-0.08% 

2.09% 

0.11 

11.31% 

5 

8.64% 

6.97% 

1.67% 

0.09 

12.06% 

in  tlie  portfolio.  When  AE  is  less  than  zero,  we  accept  the  new  state  with  prob¬ 
ability  1,  If  AE  is  greater  than  zero,  tlien  we  accept  the  new  state  with  the 
probability  where  T  represents  the  current  temperature. 

The  annealing  schedule  is  also  critical  to  the  performance  of  the  algorithm. 
Without  any  prior  knowledge  of  energy  landscapes,  one  can  only  hope  to  derive 
an  appropriate  cooling  schedule  for  a  specific  random  process.  As  being  a  cool¬ 
ing  schedule  of  the  temperature  T(f),  it  must  be  able  to  decrease  from  a  given 
suflficiently  high  temperature  To  down  to  a  zero  degree.  To  experiment  an  initial 
temperature,  we  first  let  the  system  be  free  running  with  a  100%  acceptance 
rate  for  a  certain  number  of  iterations.  By  sampling  all  energy  states,  we  can 
calculate  the  standard  deviation  to  estimate  an  initial  temperature.  We  then 
implement  the  cooling  schedule  using  the  formula,  T{t)  =  To/(l  +  t),  where  To 
is  the  initial  temperature.  The  equilibrium  detection  during  a  particular  tem¬ 
perature  is  measured  by  the  formula,  \Te8t  —  Td  <  0.01  *  Tc,  where  Teat  is  the 
estimated  temperature  from  N  samples  and  Tc  is  the  current  temperature  (see 
Szu  (9]). 

5  Implementation  and  Results 

We  focus  on  the  universe,  which  is  designed  to  capture  the  attractive  long-term 
return  potential  associated  with  small  stocks.  We  purchase  a  stock  when  it  ranks 
above  10%  in  any  economic  sectors.  A  stock  will  be  sold  when  it  ranks  below  the 
30%  by  economic  sector.  The  round-trip  trading  cost  is  3.6%.  The  total  number 
of  stocks  in  our  portfolio  is  between  50  and  60.  We  also  choose  two  widely  used 
value  measures  and  two  widely  used  growth  measures  to  form  an  integrated 
return/risk  portfolio.  The  value  measures  are  earning  to  price  ratio  and  book 
value  to  price  ratio,  whereas  the  growth  measures  are  short-term  (four  quarters) 
earnings  change  to  price  ratio  and  long-term  (three-year)  earnings  change  to 
price  ratio. 

We  collect  two  data  sets:  one  from  11/86  to  10/91  and  the  other  from  6/93 
to  5 /96.  The  corresponding  rates  of  return  of  Russell  2000  index  are  collected  as 
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Table  2.  Portfolio  construction  using  data  during  6/93  -  5/96 


Data  Set 
(6/93  -  5/96) 

Yearfs^  Portfolio 
yearts; 

Becnmark 

Return 

Excess 

Return 

Sharpe  Turnover 
R.atio  R.atio 

1 

30.51% 

4.41% 

32.10% 

1.06 

8.67% 

GM 

2 

19.31% 

11.97% 

7.34% 

0.26 

7.74% 

3 

20.62% 

15.81% 

4.81% 

0.23 

6.88% 

1 

36.11% 

4.41% 

31.70% 

1.06 

7.96% 

SA 

2 

20.92% 

11.97% 

8.95% 

0.35 

8.41% 

3 

24.03% 

15.81% 

8.22% 

0.32 

9.10% 

benchmarks.  The  results  in  Tables  1  and  2  show  that  the  simulated  annealing 
algorithm  (SA)  outperforms  both  the  gradient  maximization  (GM)  and  market 
index  in  all  time  periods.  Also,  the  longer  the  time  period,  the  lower  the  Sharpe 
ratio,  that  indicates  the  risk  is  proportional  to  the  time  period.  Although  the 
simulated  annealing  has  superior  rates  of  returns,  it  has  higher  turnover  ratios. 
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This  paper  presents  a  genetic  model  and  a  software  tool  {Rule-Evolver)  for  the 
classification  of  records  in  Databases  (DB).  The  model  is  based  on  the 
evolution  of  association  rules  of  the  IF-THEN  type,  which  provide  a  high  level 
of  accuracy  and  coverage.  The  modeling  of  the  Genetic  Algorithm  consists  of 
the  definition  of  chromosomes  representation,  the  evaluation  function,  and  the 
genetic  operators.  The  Rule-Evolver  is  a  tool  that  provides  an  environment  for 
the  evaluation  of  the  genetic  model  and  implements  the  interface  with  DBs.  The 
case  studies  evaluate  the  performance  of  the  model  in  several  benchmark  DBs. 
The  results  obtained  are  compared  with  those  of  other  models,  such  as 
Artificial  Neural  Nets,  Neuro-Fuzzy  Systems  and  Statistical  Models. 


1.  Evolutionary  Data  Mining  Systems 


Genetic  Algorithms  (GAs)  have  been  successfully  used  in  optimization  problems  [1] 
and  some  data  mining  models  can  be  found  in  the  literature  [2].  In  the  context  of  GAs, 
classification  consists  of  the  evolution  of  association  rules. 

The  quality  of  the  rules  evolved  is  measured  through  their  Accuracy  and  Coverage. 
The  accuracy  of  an  association  rule,  IF  C  THEN  P,  measures  the  rule’s  degree  of 
confidence  (Equation  1). 


Acuracy 


_ 1^  ^  I 

|C  n  P  I  +  C  n  P 


(1) 


Rule’s  coverage  may  be  interpreted  as  the  comprehensive  inclusion  of  all  the 
records  that  satisfy  the  rule.  Equation  2  represents  the  definition  of  rule’s  coverage. 


_  |C  n  P| 

~|CnP|+|c'nP 


Coverage 


(2) 
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2.  Genetic  Algorithm  Modeling  for  Data  Mining 

The  genetic  algorithm  consists  of  4  main  components:  Chromosome  Representation, 
Evaluation  Function,  Genetic  Operators  and  Initialization  of  the  Population. 

In  chromosome  representation,  categorical  or  discrete  attributes  represent  a  finite 
value-set,  or  mapped  values,  within  a  set  of  integers.  Quantitative  or  continuous 
attributes  represent  value  ranges  in  the  attribute  domain. 

Thus,  a  chromosome  must  represent  an  association  rule  by  means  of  the  value 
range  of  the  predictive,  quantitative  and  categorical  attributes  (Fig.  1). 


Attribute  1 

Attribute  2 

^  ^ ^ ^ 

Attribute  N 

min 

Max 

min 

Max 

mim 

Max 

Lower  limit  I  Upper  limit 
Minimum  value  Maximum  value 


Fig.  1.  Representation  of  the  chromosome  for  the  classification  task  by  a  vector  of  2N  values. 

In  Fig.  1,  each  gene  has  two  real  numbers:  the  minimum  and  the  maximum  values. 
This  representation  makes  it  possible  to  formulate  rules  such  as: 

IF  {(Attr.  1  e  [mim  1,  max  1])  and  (Attr.  2  e  [mim  2,  max  2])  and  ...  and  (Attr  N  € 
[mim  N,  max  Target  Attribute  =  P 

where  min.  X  and  max.  X  indicate  the  minimum  and  maximum  values  of  each 
predictive  attribute  that  has  been  defined  as  quantitative  (for  categorical  attributes, 
only  min.  X  is  used),  and  P  is  the  value  of  the  Target  Attribute,  which  has  been 
identified  as  an  objective  in  the  initial  phase  of  the  process. 

The  advantage  of  this  representation  lies  in  its  high  level  of  comprehensibility  and 
in  the  fact  that  the  domain  of  the  attributes  values  are  evolved  as  real  numbers. 

Several  evaluation  functions  were  implemented  and  tested  with  three  types  of 
rewards:  accuracy,  coverage,  and/or  both  [3],  according  to  Equation  3. 

if  ^/accuracy  0  j  and/or  (coverage  (i)i^0))  then  (3) 

fitness(i)  = /(z)*  accuracy  (/^*coverage  (i): 

Among  the  functions  (ten)  that  were  implemented,  the  Cbayesian  [3]  function  is 
worthy  of  note.  The  Cbayesian  function  is  inspired  by  the  Bayesian  classifiers  [4]  and 
represents  the  product  of  the  probabilities  that  the  values  of  a  rule’s  attributes  pertain 
to  an  interval,  given  that  the  class  of  the  current  record  is  the  one  that  has  been 
specified  as  the  objective.  Equation  4  presents  this  function,  where  A  is  an  attribute, 
ai  is  a  value  interval,  C  is  the  target  attribute,  and  c  is  the  value  of  the  specified  class. 

P(Ai  =  ail  C  =  c)  *  P(A2  =  azl  C  =  c)  *  ...  P(Ak  =  aj  C  =  c)  (4) 

Several  genetic  operators  have  been  tested  in  this  project:  one-point,  two-point, 
average,  and  uniform  crossovers,  simple  mutation,  and  a  new  mutation  operator  called 
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“don’t  care”.  This  operator  eliminates  a  given  attribute  in  the  composition  of  the  rule, 
i.e.,  the  entire  domain  of  the  attribute  is  considered  valid  for  the  rule. 

Methods  to  constrain  the  Genetic  Algorithm’s  search  space  or  to  introduce 
promising  solutions  were  implemented  to  initialize  the  population,  thus  increasing  its 
performance: 

1.  Random  seedless  initialization  methods:  a)  including  the  average  value  of  this 
attribute;  b)  including  the  median  value  of  this  attribute; 

2.  Random  initialization  methods  with  seeds:  a)  Seeds  from  previous  evolutions;  b) 
Seeds  from  random  Database  records. 

The  initialization  methods  for  group  1  consist  of  establishing  the  limits  so  as  to 
include  the  average  value  or  the  mean  value  of  those  attributes,  and  the  second  group 
makes  use  of  genetic  material  which  has  already  been  evolved  in  previous 
experiments  and  of  information  from  the  database  itself. 


3.  Rule-Evolver 

The  Rule-Evolver  is  a  data  mining  environment  which  incorporates  a  dedicated 
genetic  algorithm  model  to  evolve  classification  rules. 

The  Rule-Evolver  is  capable  of  extracting  all  the  association  rules  which 
differentiate  a  specific  record  cluster  from  the  other  records  in  a  database.  It 
comprises  4  dedicated  modules: 

•  Selection  of  database  attributes  Allows  the  user  to  choose  attributes  of  interest 
based  on  each  attribute’s  average,  variance,  and  variation  coefficient; 

•  Interpretation  Presents  the  best  mles  found  in  the  IF  (AJ  and  A2  and  A3  and 
...  An;  ri/EAP  format; 

•  Graphic  Previewing  -»  Plots  the  accuracy,  coverage,  and  fitness  value 
(evaluation  function)  graphs  of  the  individuals  in  the  course  of  genetic  evolution; 

•  Parameterization  of  the  environment  — >  Allows  the  user  to  specify  rates, 
parameters,  evaluation  functions,  operators,  and  evolutionary  techniques. 


4.  Case  Studies 

The  benchmark  databases  used  in  this  study  were  obtained  from 
“ftp://ftp.ics.uci.edu/pub/machine-learning-databases/”  repository.  Two  case  studies 
are  summarized  in  this  article:  Iris  Plants  Database  and  Tic-Tac-Toe  Endgame 
Database.  These  databases  were  divided  into  2  sets:  a  training  set  and  a  test  set. 

The  “Iris  Plants  Database”  comprises  150  records  divided  into  3  classes  of  50 
records  each  (Iris  Setosa,  Versicolour,  and  Virginica),  with  4  attributes:  the  plant’s 
petal  and  sepal  width  and  length. 

In  Table  1,  the  classification  results  obtained  by  the  Rule-Evolver  are  compared 
with  those  obtained  by  means  of  other  techniques:  the  NEFCLASS  Neuro-Fuzzy 
System;  the  Hierarchical  Neuro-Fuzzy  System  (NFHQ),  the  Bayesian  Neural  Net  with 
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the  gaussian  approximation  method,  and  a  Bayesian  Neural  Net  with  the  Markov 
Chain  Monte  Carlo  method  (MCMC). 


Table  1.  Iris  and  Tic-Tac-Toe  database  as  benchmark 


IRIS  DATABASE  (Error  In  The  Training  And 
Test  Set  -  %) 

TIC-TAC-TOE  DATABASE  (Error  In  The 
Test  Set  -  %) 

MODEL 

TRAINING 

TEST 

MODEL 

TEST 

NEFCLASS  [5] 

2,67 

4 

Newld  [6] 

84 

NFIIQ  [7] 

2,67 

2,67 

CN2  [6] 

98,1 

RNB  (BNN)  18) 

0 

2,67 

MBRTalk  [6] 

88,4 

RNB  (MCMC)  18] 

0 

2,67 

IB3-CI  [6] 

99,1 

RULE-EVOLVER  (GA) 

0 

13,34 

RULE-EVOLVER  (GA) 

97,6 

The  results  indicate  that,  though  the  Rule^Evolver  obtained  good  results  in  the 
training  set,  it  presented  a  low  performance  level  in  the  test  set.  Of  the  10  records 
reported  as  errors  (13,34%),  3  records  were  wrongly  classified  between  the  Iris 
Versicolour  and  Virginica  classes,  and  7  records  were  not  classified,  i.e.,  they  were 
not  covered  by  any  of  the  rules  obtained  during  training.  This  problem  is  typical  of 
models  that  use  training  and  test  sets:  the  model  does  not  succeed  in  finding  rules 
whose  attribute  values  are  not  in  the  training  set. 

The  second  database  tested  was  the  “Tic-Tac-Toe  Endgame  database'',  which 
encodes  the  complete  set  of  possible  final  configurations  of  the  game  board  under  the 
assumption  that  ‘x’  has  made  the  first  move.  It  is  composed  of  958  records,  of  which 
626  are  “x  wins”  with  9  attributes,  each  corresponding  to  a  position  on  the  game 
board. 

Table  2  presents  the  decoding  of  the  rules  found  by  the  Rule-Evolver  which  lead 
“x  to  win”,  where  symbol  #  (don’t  care)  means  that  “x  wins”  regardless  of  the  values 
contained  in  that  position  of  the  board.  Symbol  “b”  represents  a  blank  field,  and 
symbol  “o”  represents  the  other  player. 

Table  2.  Decoding  of  the  rules  found  by  the  Rule-Evolver  for  Tic-Tac-Toe 


Rulel 

Rule  2 

Rules 

Rule  4 

□ 

'mm 

# 

# 

1 

* 

X 

* 

# 

# 

X 

# 

# 

# 

mm 

wm 

# 

X 

# 

# 

X 

Bi 

# 

# 

mm 

# 

# 

X 

# 

# 

X 

Bi 

216  Records 

153  Records 

41  Records 

39  Records 

- 

Rules 

Rule  6 

Rule? 

Rules 

.n 

X 

1 

mm 

X 

X 

# 

# 

■■ 

# 

# 

# 

mn 

Bi 

X 

KB 

# 

# 

X 

X 

X 

# 

■■ 

ra 

X 

mm 

# 

# 

# 

# 

# 

X 

46  Records 

□ 

43  Records 

39  Records 

44  Records 

This  example  shows  that  621  of  the  626  (99,2%)  records  were  covered  by  the 
rules  evolved  by  the  Rule-Evolver  and  5  records  were  not  covered  on  account  of  the 
Rule  3,  which  specialized  in  position  5  on  the  board  (xo). 

The  results  indicate  that  4  records  have  been  classified  as  being  characteristics  of 
“x’s”  victory,  though  they  are  not,  and  3  records  that  lead  to  “x  wins”  have  not  been 
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classified.  The  model  therefore  fits  280  of  the  287  records  of  the  test  set,  thus 
presenting  a  success  rate  of  97,6  %. 


5.  Conclusions 

Record  classification  through  the  evolution  of  associative  rules  with  the  use  of 
Genetic  Algorithms  has  proved  to  be  a  promising  procedure  for  characterizing  the 
record  clusters  in  a  database.  When  compared  with  other  methods  (Artificial  Neural 
Nets,  Statistical  Models),  the  advantage  of  rule  discovery  by  means  of  GAs  is  that  the 
rules  evolved  are  self-explanatory. 

The  current  model  is  incapable  of  evolving  correctly  the  interval  of  a  rule’s 
attribute  values  for  values  which  are  not  exemplified  in  the  training  set.  However,  the 
genetic  model  is  capable  of  generating  highly  accurate  rules  with  a  high  level  of 
coverage  without  any  conflicts  for  the  intervals  present  in  training. 

To  evaluate  one  generation  of  chromosome  rules,  the  Rule-Evolver  makes  a 
single  pass  over  the  data,  which  may  be  all  available  in  the  memory,  thus  reducing 
processing  time.  The  Rule-Evolver  automatically  tries  to  load  the  whole  database 
into  memory;  if  not  possible,  it  will  access  the  data  through  the  DBMS. 

For  the  databases  tested  (hundred  of  records  of  less  than  ten  attributes),  the  Rule- 
EvolveEs  run  time  was  satisfactory  -  about  6  minutes  in  average  on  a  Pentium  II  350 
Mhz  for  the  evaluation  of  80  generations  of  populations  of  100  individuals  each. 

Hie  scalability  of  the  genetic  model  in  applications  with  large  databases  (say  with 
millions  of  records)  has  not  been  accessed  yet. 
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Abstract.  The  paper  is  concerned  with  the  decision  making  with  pre¬ 
dictive  models  acquired  from  data  called  probabilistic  decision  tables. 
The  methodology  of  probabilistic  decision  tables  presented  in  this  arti¬ 
cle  is  derived  from  the  theory  of  rough  sets.  In  this  methodology,  the 
probabilistic  extension  of  the  original  rough  set  theory,  caUed  variable 
precision  model  of  rough  sets,  is  used.  The  theory  of  rough  sets  is  ap¬ 
plied  to  identify  dependencies  of  interest  occurring  in  data.  The  identified 
dependencies  are  represented  in  the  form  of  a  decision  table  which  subse¬ 
quently  is  analyzed  and  optimized  using  rough  sets-based  methods.  The 
original  model  of  rough  sets  is  restricted  to  the  analysis  of  functional,  or 
partiaJ  functional  dependencies.  The  Vciriable  precision  model  of  rough 
sets  can  also  be  used  to  identify  probabilistic  dependencies,  allowing  for 
construction  of  probabilistic  predictive  models.  The  main  focus  of  the 
paper  is  on  decision  ma^ng  aspect  of  the  presented  approach,  in  partic¬ 
ular  on  setting  the  parameters  of  the  model  and  on  decision  strategies 
to  maximize  the  expected  gain  from  the  decisions. 


1  Introduction 

Standard  decision  tables  are  tabular  models  of  the  functional  dependencies  be¬ 
tween  input  conditions  and  decisions  or  actions  taken  in  response  to  the  oc¬ 
currence  of  some  combinations  of  conditions.  They  have  been  used  in  software 
engineering,  circuit  design  and  other  application  areas  for  years  [6].  The  depen¬ 
dency  is  encoded  by  the  table  designer  in  the  form  of  a  set  a  disjoint  decision 
rules  covering  all  possible  input  situations.  However,  in  many  problems  related 
to  decision  making  with  uncertainty,  machine  learning,  pattern  recognition  and 
data  mining,  the  condition-decision  dependency  is  typically  unknown  and  almost 
always  non- deterministic.  Often,  it  is  hidden  in  empirical  data.  A  number  of  an¬ 
alytical  methodologies  have  been  developed  in  recent  years  to  approximate  this 
kind  of  the  dependency  for  the  purpose  of  prediction  or  better  understanding 
of  the  nature  of  the  relationship,  for  example,  by  using  decision  trees,  neural 
networks  or  rough  sets  [3,5,  9-12]. 

In  this  paper,  we  will  focus  on  using  decision  tables  extracted  from  data  for 
that  purpose.  The  research  into  decision  tables  acquisition  from  data  was  initi¬ 
ated  by  Pawlak  in  the  context  of  rough  sets  theory [1,2].  His  original  works  were 
concerned  with  the  acquisition  of  deterministic,  or  partially  deterministic  tables. 
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We  demonstrate  how  an  extended  approach,  called  variable  precision  rough  sets 
model  (VPRS),  can  be  applied  to  acquistion  of  non-deterministic  decision  ta¬ 
bles  with  probabilistic  characterization  of  their  decision  accuracy  [8].  In  what 
follows,  the  review  of  the  methods  of  rough  sets  for  the  above  mentioned  data- 
based  modeling  problem  is  presented  and  illustrated  with  simple  examples.  A 
comprehensive  discussion  of  the  optimal  decision  making  strategies  and  param¬ 
eter  setting  for  the  model  is  also  included.  Generally,  the  objective  is  not  to 
construct  a  predictive  system  which  would  guarantee  always  correct  predictions 
(which  is  typically  impossible)  ,  but  to  have  a  system  which  would  support  de¬ 
cisions  with  sufficient  success  rate  in  the  longer  run,  or  sufficient  expected  gain 
or  profit  from  the  decision  making. 

The  paper  is  organized  as  follows.  We  first  discuss  the  basics  of  the  formal 
model  of  decision  tables  acquired  from  data.  Then,  the  main  definitions  of  the 
variable  precision  model  of  rough  sets  are  introduced.  In  the  next  sections,  they 
are  used  to  define  extended  notions  of  the  dependency  between  attributes,  of 
the  extended  reduct  and  core  attributes.  A  separate  section  is  devoted  to  the 
discussion  of  the  optimal  decision  making  with  the  probabilistic  decision  tables. 

2  Decision  Tables  Acquired  from  Data 

Generally,  the  decision  table  is  defined  here  as  a  tabular  representation  of  a 
relation  discovered  in  data.  The  relation  is  identified  through  a  classification 
process  in  which  data  objects  having  the  same  values  of  selected  attributes, 
or  having  the  same  values  of  properly  selected  functions  of  the  attributes  (for 
example,  using  some  attribute  value  discretization  technique),  are  considered  to 
be  identical.  It  should  be  noted,  however,  that  this  kind  of  the  decision  table  does 
not  necessarily  represent  functional  relationship  as  it  is  the  case  with  ”  classical” 
decision  tables  known  in  software  engineering  and  other  areas.  More  precisely, 
the  data-extracted  decision  table  is  defined  as  follows: 

Let  U  be  the  universe  of  objects  e  £  U  and  a  gA  be  the  attributes  of  the 
objects,  that  is  functions  a  :  e a(e)  assigning  some  features  (attribute  values) 
to  objects.  We  assume  that  every  attribute  maps  into  a  finite  set  of  values,  Va 
£  range{a).  The  attributes  are  divided  into  two  categories,  condition  attributes 
C  =  {ai,a2,— ,0m}  and  the  decision  attributes  D.  Typically,  the  condition 
attributes  represent  measurable  properties  of  objects  whereas  decision  attributes 
are  the  ’’predictive”  attributes  (variables)  whose  values  are  normally  predicted 
based  on  known  values  of  condition  attributes.  We  will  assume  here,  without 
loss  of  generality,  that  there  is  only  one  binary- valued  decision  attribute  d  £  D 
and  one  value  vj  (z  =  0  or  1)  of  this  attribute  has  been  selected  as  a  prediction 
or  modeling  ’’target”.  With  all  these  assumptions,  the  decision  table  can  be 
expressed  as  a  quadruple  <  t7,  C,  d,v*^  >  .  Each  of  the  two  values  ,  v\  of  the 
decision  attribute  d  corresponds  to  a  set  of  objects  matching  that  particular 
value.  We  will  denote  these  sets  as  and  respectively.  Clearly,  X^  = 
and  A®  U  =  U,  Our  objective  in  the  construction  and  analysis  of  the  decision 
tables  is  to  develop  a  simple  predictive  model  for  the  target  set  which  would 
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□ 
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0.85 
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□ 
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0.01 

0.99 

Ei 
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□ 
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□ 

1.00 

0.00 

E^ 

□ 

□ 
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□ 

0.82 

0.18 

Es 

□ 

Q 

□ 

m 

0.12 

0.88 

Ej 

□ 

U 

□ 

□ 

0.02 

0.08 

E^ 

□ 

□ 

□ 

a 

0.91 

0.09 

Table  1.  Classification  by  condition  attributes  only 


enable  us  to  predict,  with  an  acceptable  confidence,  whether  an  object  matching 
a  combination  of  attribute  values  occurring  in  the  decision  table  belongs  to  the 
target  set,  or  to  its  complement. 

3  VPRS  Model  of  Rough  Sets 

In  data  mining  and  predictive  modeling  applications  the  variable  precision  model 
of  rough  sets  (VPRS)  was  used  for  analysis  of  decision  tables  extracted  from  data. 
The  VPRS  model  extends  the  capabilities  of  the  original  model  of  rough  sets 
to  handle  probabilistic  information.  The  main  aspects  of  the  VPRS  model  are 
presented  below. 

Let  R  be  an  equivalence  relation  (  called  the  indiscernibility  relation)  and  let 
R*  be  the  set  of  equivalence  classes  of  R.  Typically,  the  relation  R  represents  the 
partitioning  of  the  universe  U  in  terms  of  the  values  of  condition  attributes  as 
defined  in  Section  2.  Also,  let  E  £  R*  he  an  equivalence  class  (elementary  set)  of 
the  relation  R.  With  each  class  E  we  can  associate  the  estimate  of  the  conditional 
probability  P( A 1^)  by  the  formula:  P( A I J5)  =  card(X  r\E)/card{E)  assum¬ 
ing  that  sets  X  and  E  are  finite.  This  situation  is  illustrated  in  Table  1  which 
represents  the  classification  of  raw  data  in  terms  of  condition  attributes  S,H,E,C, 
with  each  class  Ei  being  assigned  probabilities  P{T  —  1|£^,)  and  P(T  =  0|jEj). 

Let  0<l<u<l  he  real- valued  approximation  precision  control  parameters 
called  lower  and  upper  limits  respectively.  For  any  subset  X  C  U  we  define  the 
u-positive  region  of  X,  POSu(X)  as  a  union  of  those  elementary  sets  whose 
conditional  probability  P(X\E)  is  not  lower  than  the  upper  limit  ,  that  is 

POSu{X)  -  (J{£;  €  R*  :  P(X\E)  >  u} 

The  u-positive  region  of  X  represents  an  area  in  the  universe  which  contains 
objects  with  relatively  high  probability  of  belonging  to  the  set  X. 

The  (l,u)-boundary  region  BN  Ri^u{X)  of  the  set  X  with  respect  to  the  lower 
and  upper  limits  £  and  u  is  a  union  of  those  elementary  sets  E  for  which  the 
conditional  probability  P{X\E)  is  higher  than  the  lower  limit  £  and  lower  than 
the  upper  limit  u.  Formally, 
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Table  2.  The  probabilistic  decision  table  with  u  =  0,8  and  I  =  0.2 


BNRi,u(X)  =  eR*  :i<  P{X\E)  <  u} 

The  boundary  area  represents  objects  which  cannot  be  classified  with  suffi¬ 
ciently  high  confidence  (represented  by  u)  into  set  X  and  which  also  cannot  be 
excluded  from  X  with  the  sufficiently  high  confidence  (  represented  by  1  -  /). 

The  l~negativ€  region  N EGi{X)  of  the  subset  X,  is  a  collection  of  objects 
which  can  be  excluded  from  X  with  the  confidence  not  lower  than  1  —  that  is, 

NEGi(X)  =  [j{E  eR*:P{U-X\E)>l- 1} 

The  l-negative  region  represents  objects  of  the  universe  for  which  it  is  known 
that  it  is  relatively  unlikely  that  they  would  belong  to  X. 

In  the  Table  2,  each  of  the  classes  Ei  of  the  Table  1  is  assigned  to  one 
of  the  rough  approximation  regions,  according  to  the  above  definitions,  with 
u  =  0.8  and  /  =  0.2.  The  decision  table  in  which  each  combination  of  condition 
attributes  is  assigned  its  approximation  region  with  respect  to  the  target  value 
of  the  decision  attribute  (in  this  example,  T  =  1)  is  called  probabilistic  decision 
table  [8].  The  probabilistic  decision  table  can  be  used  to  predict  the  target  value 
of  the  decision  attribute,  or  its  complement,  with  probabilities  not  lower  than  u 
and  1  —  respectively. 


4  (i,it)-Dependency  in  Decision  Tables 

The  analysis  of  decision  tables  extracted  from  data  involves  inter-attribute  de¬ 
pendency  analysis,  identification,  elimination  of  redundant  condition  attributes 
and  attribute  significance  analysis  [2] .  The  original  rough  sets  model-based  anal¬ 
ysis  involves  detection  of  functional,  or  partial  functional  dependencies  and  sub¬ 
sequent  dependency-preserving  reduction  of  condition  attributes.  In  this  paper, 
we  extend  this  idea  by  using  (l,u)-probabilistic  dependency  as  a  reference  rather 
than  functional  or  partial  functional  dependency.  To  define  (/,  ii)-probabilistic 
dependency  we  will  assume  that  the  relation  R  corresponds  to  the  partitioning 
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Table  3.  The  probabilistic  decision  table  with  u  =  0.83  and  /  =  0.11 


of  the  universe  U  in  terms  of  values  of  condition  attributes  C  in  the  decision 
table  <  U,Cjd,vii  >  ,  (i  =  0  or  1).  In  other  words,  we  assume  that  objects 
having  identical  values  of  the  attributes  are  considered  to  be  equivalent. 

The  (ljU)‘probabilistic  dependency  7/,u (C',  d,  i)  between  condition  attributes 
C  and  the  decision  attribute  d  in  the  decision  table  <  U,  (7,  d,  v'^  >  is  defined 
as  the  total  relative  size  of  (/,  u)- approximation  regions  of  the  subset  X*  C  U 
corresponding  to  target  value  of  the  decision  attribute  .  In  other  words,  we  have 
%u{C,  d,  i)  =  {card{POSn{X'))  +  card{NEGi{X')))/card{U) 

The  dependency  degree  can  be  interpreted  as  a  measure  of  the  probability 
that  a  randomly  occurring  object  will  be  represented  by  such  a  combination  of 
condition  attribute  values  that  the  prediction  of  the  corresponding  value  of  the 
decision  attribute  could  be  done  with  the  acceptable  confidence,  as  represented 
by  (/,  u)  pair  of  parameters. 

To  illustrate  the  notion  of  (^/,tij-dependency  let  us  consider  the  classification 
given  in  Table  1  again.  When  u  =  0.80  and  I  =  0.15  the  dependency  equals 
to  1.0.  This  means  that  every  object  e  from  the  universe  U  can  be  classified 
either  as  the  member  of  the  target  set  with  the  probability  not  less  than  0.8,  or 
the  member  of  the  complement  of  the  target  set,  with  the  probability  not  less 
than  0.85.  The  lower  and  upper  limits  define  acceptable  probability  bounds  for 
predicting  whether  an  object  is,  or  is  not  the  member  of  the  target  set. 

If  dependency  is  less  than  one  it  means  that  the  information  con¬ 

tained  in  the  table  is  not  sufficient  to  make  either  positive,  or  negative  predic¬ 
tion  in  some  cases.  For  instance,  if  we  take  u  =  0.83  and  I  =  0.11  then  the 
probabilistic  decision  table  will  appear  as  shown  in  Table  3.  As  we  see,  when 
objects  are  classified  into  boundary  classes,  neither  positive  nor  negative  pre¬ 
diction  with  acceptable  confidence  is  possible.  This  situation  is  reflected  in  the 
(0.11, 0.83)— dependency  being  0.7  (assuming  even  distribution  of  atomic  classes 
E*!,  ^8  in  the  universe  U). 

5  Optimization  of  Precision  Control  Parameters 

An  interesting  question,  inspired  by  practical  applications  of  the  variable  preci¬ 
sion  rough  set  model,  is  how  to  set  the  values  of  the  precision  control  parameters  I 
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and  u  to  achieve  desired  quality  of  prediction.  It  is,  in  fact,  an  optimization  prob¬ 
lem,  strongly  connected  to  the  external  knowledge  of  possible  gains  and  losses 
associated  with  correct,  or  incorrect  predictions,  respectively.  It  also  depends  on 
the  quality  of  the  information  encoded  in  data  used  to  create  the  probabilistic 
decision  table.  In  general,  setting  lower  values  of  I  and  higher  values  of  u  re¬ 
sults  in  increasing  the  size  of  the  boundary  area  on  the  expense  of  positive  and 
negative  regions.  In  practical  terms,  this  means  that  we  my  not  be  always  able 
to  make  decisions  with  the  confidence  level  we  would  like  it  to  be.  If  nothing 
is  known  about  the  potential  gains  or  losses  eissociated  with  the  decisions,  the 
reasonable  goal  is  to  increase  the  likelihood  of  positive  correct  prediction  about 
the  target  value  of  the  decision  attribute,  i.e.  above  random  guess  probability 
of  success  (by  positive  correct  prediction  we  mean  correctly  predicting  that  the 
selected  value  will  occur).  Similarly,  we  are  interested  in  increasing  the  proba¬ 
bility  of  negative  correct  prediction,  i.e.  predicting  correctly  that  a  particular 
target  value  will  not  occur.  We  would  like  this  probability  to  be  above  random 
guess  probability  of  success  as  well.  That  is,  given  the  distribution  of  the  target 
value  of  the  decision  attribute  to  be  (p,  1  —  p),  where  p  is  the  probability  that  an 
object  has  the  target  value  of  the  decision  attribute,  and  1  —  p  is  the  probability 
that  it  does  not,  the  reasonable  settings  of  the  parameters  are  0  <  /  <  p  and 
1  >  w  >  p.  With  the  settings  falling  into  these  limits,  in  the  negative  region 
the  prediction  that  object  does  not  belong  to  the  target  set  would  be  made  with 
the  confidence  higher  than  random  guess,  i.e.  with  the  probability  not  less  than 
I  —  I  >  1  —  p  and,  in  the  positive  region,  the  prediction  that  an  object  belongs 
to  the  target  set  would  be  made  with  the  probability  not  less  than  u, 

Clearly,  other  factors  can  affect  the  selection  of  the  precision  control  param¬ 
eters.  In  particular,  an  interesting  question  is  how  to  set  those  parameters  in  a 
game  playing  situation,  where  each  decision  making  act  is  carrying  a  cost  {bet 
cost  b  >  0)  and  incorrect  decision  results  in  a  loss  whereas  correct  decision  re¬ 
sults  in  a  win.  Because  there  are  two  possible  outcomes  of  the  decision,  and  one 
can  pick  any  of  these  outcomes,  there  are  two  kinds  of  losses  and  two  kinds  of 
wins: 

—  positive  win,  when  positive  outcome  is  bet  (that  is,  that  the  target  value  will 
occur)  and  that  outcome  really  occurred;  the  win  is  denoted  here  as  q'^'^  >  0 
and  the  cost  of  this  betting  is  denoted  as  b^ ; 

—  positive  loss,  when  positive  outcome  is  bet  but  that  outcome  did  not  occur; 
the  loss  is  denoted  here  as  q'^~  <  0; 

—  negative  win,  when  the  negative  outcome  is  bet  (that  is,  that  the  target  value 
will  not  occur)  and  that  outcome  really  occurred;  the  win  is  denoted  here  as 
q  >0  and  the  cost  of  this  betting  is  denoted  as  b~ ; 

—  negative  loss,  when  the  negative  outcome  is  bet  but  that  outcome  did  not 
occur;  the  loss  is  denoted  here  as  <  0; 

In  addition  to  the  assumptions  listed  above  we  will  assume  that  both  positive 
and  negative  wins  are  not  smaller  than  the  cost  of  betting,  that  is  q  >b~  >0 
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and  g++  and  that  the  absolute  values  of  both  negative  and  positive 

losses  are  not  smaller  than  the  bet,  that  is  >  h"  and  |^‘*‘”|  >  • 

Also,  with  each  approximation  region  we  will  associate  an  expected  gain 
function,  which  is  the  weighted  average  of  wins  and  losses  in  the  respective 
region.  Our  decision  making  strategy  assumes  that  in  the  positive  region  the 
positive  outcome  is  bet,  and  that  in  the  negative  region,  the  negative  outcome 
is  bet.  The  bet  in  the  boundary  region  will  depend  on  the  value  of  the  expected 
gain  function,  and  we  will  assume  that  the  bet  which  maximizes  the  gain  function 
is  selected.  The  gain  functions  Q{approximation  region)  are  defined  as  follows: 

-  QCPO^)  =p(-f|P05)*g++4-p(--lP05)+g+“  where  p(+|Pa5)  andp(-|POS') 
are  conditional  probabilities  of  positive  and  negative  outcomes  respectively 
within  the  positive  region; 

-  Q{NEG)  =  p(-\-\NEG)  *  (/-+  +  p{-\NEG)  ♦  q—  where  p[^NEG)  and 
p(—\NEG)  are  conditional  probabilities  of  positive  and  negative  outcomes 
respectively  within  the  negative  region; 

-  Q(PiVD)  =  p(+|PArD)*5+++p(-|PAI))*g+-  or  Q(BND)=p{+\BND)^ 
q-+  _|.  p[~-\BND)  *  q —  ,  depending  on  the  bet,  whichever  value  is  higher 
with  the  positive,  or  negative  bet,  where  p(-\-\BND)  and  p{-\BND)  are  con¬ 
ditional  probabilities  of  positive  and  negative  outcomes  respectively  within 
the  boundary  region. 

Let  us  note  that: 


1.  Q(POS)  >u*  4*  (1  -  u)  *  q'^~  and 

2.  Q(NEG)  >  I  *  q'-^  {I  -  1)  *  q—  . 

The  uncertain  decision  is  considered  advantageous  and  justified  if  the  ex¬ 
pected  gain  is  not  lower  than  the  cost  of  the  bet,  i.e.  if  Q(POS)  >  b  and 
Q(NEG)  >  b,  assuming  that  positive  outcome  is  bet  in  the  positive  region 
and  negative  outcome  is  bet  in  the  negative  region.  By  focusing  on  these  two 
regions  we  can  determine  from  (1)  and  (2)  the  bounds  for  parameters  I  and 
u  to  maximize  the  size  of  positive  and  negative  regions  while  guaranting  that 
Q(POS)  >  b  and  Q(NEG)  >  b.  From  conditions  u  *  q"^"^  -f-  (1  -  u)  *  g'*'"'  >  6 
and  I  *  g“"+  -f  (1  -  /)  *  g —  >  6  we  get  the  following  bounds  for  the  precision 
control  parameters: 

1  _ I  c\  ^  ^  f-q 


1  >  «  >  /44-V-  and  0  <  /  <  . 

To  maximize  the  sizes  of  both  positive  and  negative  areas  the  upper  limit 
should  assume  the  minimal  range  value  and  the  lower  limit  should  assume  the 
maximal  range  value,  that  is: 

We  should  be  aware  however  that  these  bounds  set  only  the  requirements  how 
the  rough  approximation  regions  should  be  defined  in  order  to  obtain  desired 
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expected  results  of  decision  making  processes.  The  actual  data  set  may  not 
support  these  bounds  in  the  sense  that  the  positive,  negative  or  both  regions  may 
be  empty  resulting  in  the  boundary  area  covering  the  whole  universe.  In  general, 
it  can  be  demonstrated  that  in  the  boundary  area,  regardless  whether  positive 
or  negative  bet  is  made,  the  expected  gain  is  always  less  than  the  respective 
bet,  that  is  Q(BND)  <  6***,  if  the  positive  bet  is  taken,  and  Q(BND)  <  b" ,  if 
the  negative  bet  is  taken.  Consequently,  if  the  decision  has  to  be  made  in  the 
boundary  area,  one  should  take  the  one  which  maximizes  Q{BND)y  but  in  the 
longer  run  the  "player”  is  in  the  loosing  position  anyway  in  the  boundary  area. 
The  expected  gain  G  from  making  decisions  based  on  the  whole  decision  table, 
according  with  the  assumptions  and  decision  strategy  described  above,  is  given 
by: 

G  =  p{POS)  ♦  Q{POS)  +  p{NEG)  *  Q(NEG)  +  p{BND)  ♦  Q[BND) 

where  p{POS)^  p{NEG)  and  p(BND)  are  the  probabilities  of  respective 
approximation  regions  (the  probabilities  mentioned  here  can  be  approximated 
based  on  frequency  distribution  of  data  records  belonging  to  the  respective  re¬ 
gions)  .  Only  if  the  overall  expected  gain  G  is  higher  than  the  expected  cost  of 
betting  the  "player”  is  winning  in  the  longer  run.  This  clearly  sets  the  limit  on 
the  applicability  of  the  probabilistic  decision  tables  to  support  decision  making. 

6  Summary 

We  briefly  review  in  this  section  the  main  points  of  the  described  approach  to 
predictive  modeling  and  decision  making. 

The  main  distinguishing  feature  of  this  approach  is  that  it  is  primarily  con¬ 
cerned  with  the  acquisition  of  decision  tables  from  data  and  with  their  analysis 
and  simplification  using  notions  of  attribute  dependency,  reduct,  core  and  at¬ 
tribute  significance.  The  decision  tables  represent  "discovered”  inter-data  de¬ 
pendencies  which  implies  that,  in  general,  a  number  of  decision  tables  can  be 
extracted  from  a  given  data  collection.  An  important  issue  in  the  whole  process 
of  decision  table  acquisition  from  data  is  a  choice  of  the  mapping  from  original 
attributes,  in  which  raw  data  are  expressed,  to  finite- valued  attributes  used  in 
the  decision  table.  This  is  an  application  domain-specific  task,  often  requiring 
deep  knowledge  of  the  domain.  One  popular  technique  is  discretization  of  con- 
tinous  attributes.  However,  the  discretization  of  continuous  attributes  [7],  is  a 
comprehensive  research  topic  in  itself  whose  discussion  goes  beyond  the  scope  of 
this  article.  The  decision  making  with  probabilistic  decision  tables  is  typically 
uncertain.  The  decision  strategy  involves  making  positive  prediction  in  the  pos¬ 
itive  region,  negative  prediction  in  the  negative  region,  and  positive  or  negative 
prediction  in  the  boundary  region,  depending  on  the  value  of  the  gain  function. 
The  techniques  described  in  this  article  are  aimed  at  constructing  probabilistic 
decision  tables  which  would  support  uncertain  decision  making  leading  to  long 
range  gains  rather  than  to  correct  decisions  in  each  case.  They  seem  to  be  appli¬ 
cable  to  practical  problems  involving  making  guesses  beised  on  past  data,  such 
as  stock  market  price  movements  prediction  or  market  research. 
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Abstract.  Inspired  with  Version  Space  learning,  the  Iterated  Version 
Space  Algorithm  (I  VS  A)  has  been  designed  and  implemented  to  learn 
disjunctive  concepts.  I  VS  A  dynamically  partitions  its  search  space  of 
potenti^ll  hypotheses  of  the  teirget  concept  into  contour-shaped  regions 
until  all  training  instances  cire  maximally  correctly  classified. 


1  Introduction 


Since  mid  1950s,  many  AI  researchers  have  developed  learning  systems  that  au¬ 
tomatically  improve  their  performance.  Vere^s  Multiple  Convergence  Algorithm 
[12]  and  Mitchell’s  Candidate  Elimination  Algorithm  [6]  introduced  a  novel  ap¬ 
proach  to  concept  learning  known  as  the  Version  Space  Algorithm  (VS A).  Un¬ 
like  other  learning  algorithms,  which  used  either  generalization  or  specialization 
alone,  VS  A  employed  both.  VS  A  has  advantages  -  no  back  tracking  for  any  seen 
training  instances  and  a  unique  concept  description  that  is  consistent  with  all 
seen  instances.  VS  A  has  weaknesses  -  training  instances  must  be  noise  free  and 
the  target  concept  must  be  simple.  These  problems  have  prevented  VS  A  from 
practical  use  outside  the  laboratories. 

During  the  last  few  years,  many  improved  algorithms  based  on  VSA  have 
been  designed  and/or  implemented.  In  section  2,  we  first  introduce  VSA  and 
then  highlight  two  improved  methods  compare  with  the  IVSA  approach.  The 
discussion  in  section  2  focuses  on  learning  a  disjunctive  concept  from  six  training 
instances,  which  are  noise  free  so  that  problems  caused  by  learning  disjunctive 
concepts  can  be  isolated  from  problems  caused  by  noise  training  instances.  Sec¬ 
tion  3  presents  the  overall  approach  of  IVSA.  Preliminary  experimental  results 
on  several  ML  databases  [10]  and  English  pronunciation  databases  [13]  are  pre¬ 
sented  in  Section  4.  Discussions  on  each  specific  test  and  sample  rules  are  also 
given  in  Section  4.  In  Section  5,  we  summarize  current  research  on  IVSA  and 
give  suggestions  for  future  research. 
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2  Version  Space  Related  Research 

2.1  The  Version  Space  Algorithm 

A  version  space  is  a  representation  that  contains  two  sets  of  hypotheses,  the 
general  hypotheses  (G  set)  and  the  specific  hypotheses  (S  set).  Both  G  and  S 
must  be  consistent  with  all  examined  instances.  Positive  instances  make  the  S 
set  more  general  to  include  all  positive  instances  seen,  while  negative  instances 
make  the  G  set  more  specific  to  exclude  all  negative  instances  seen.  If  the  training 
instances  are  consistent  and  complete,  G  and  S  sets  eventually  merge  into  one 
hypothesis  set.  This  unique  hypothesis  is  the  learned  concept  description. 


pruned 


S3 


P3:  [breakfast,  cheap,  bread,  tea,  Tim’s,  IJ 
Si)  [?,?,  rice,  ?,?,!] 

P2:  [supper,  cheap,  rice,  tea,  Tim’s,  1] 

SI  ^  [lunch,  expensive,  rice,  coffee,  Sam’s,  1] 


PI:  [lunch,  expensive,  rice,  coffee,  Sam’s,  I] 


Fig.  1.  The  Version  Space  after  the  Fourth  Instcince  (P3) 


The  following  six  noise-free  training  instances  have  been  selected  to  illustrate 
problems  with  VS  A.  The  value  T’  or  ‘O'  in  each  instance  indicates  that  the 
patient  had  a  positive  or  negative  allergic  reaction  respectively. 

Pi  =  (lunch,  expensive,  rice,  coffee,  Sam's,  1) 

Ni  —  (supper,  expensive,  bread,  coffee,  Tim's,  0) 

P2  =  (supper,  cheap,  rice,  tea,  Tim's,  1) 

Pg  =  (breakfast,  cheap,  bread,  tea,  Tim’s,  1) 

P4  =  (supper,  expensive,  rice,  tea,  Bob’s,  1) 

P5  =  (supper,  cheap,  rice,  coffee,  Sam's,  1) 

Figure  1  shows  that  as  soon  as  instance  P3  is  processed,  the  new  specific 
hypothesis  S3  must  be  discarded  due  to  over-generalization.  When  either  G  or 
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S  set  becomes  empty,  the  version  space  is  collapsed,  and  thus  ^‘No  legal  concept 
description  is  consistent  with  this  new  instance  as  well  as  all  previous  training 
instances”  [6],  although  a  concept  description  of  ‘‘tea  or  rice”  can  be  easily 
derived  by  hand.  To  improve  VSA  learning,  Hirsh  has  designed  a  new  algorithm, 
the  Incremental  Version  Space  Merging  (IVSM)  [3]. 


2.2  The  Incremental  Version  Space  Merging 

Instead  of  building  one  version  space,  IVSM  constructs  many  version  spaces 
V5i...n  where  n  is  the  number  of  training  instances.  For  each  i  €  n,  IVSM 
first  constructs  VSi  using  only  one  training  instance,  and  then  computes  the 
intersection  of  VSi  and  V5(,_i).  That  is,  for  each  pair  of  boundary  hypotheses 
Gl,  SI  in  VSi  and  G2,  S2  in  V«S'(i_i)n(i-2))  IVSM  repeatedly  specializes  each 
pair  of  hypotheses  in  Gl  and  G2,  and  generalizes  each  pair  of  hypotheses  in  SI 
and  S2  to  form  a  new  version  space  V5jn(t-i)*  This  merging  process  repeats 
until  all  the  instances  have  been  learned. 


Fig.  2.  The  IVSM  Approach  after  Processing  the  Fourth  Instance  (Pq) 


The  same  six  training  instances  are  used  to  demonstrate  IVSM  learning. 
In  Figure  2,  after  IVSM  has  computed  the  intersection  for  V S^  and  VS'e,  the 
resulting  specific  hypothesis  [?,  ?,  ?,  ?,  ?,  1]  is  overly  generalized.  According  to 
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the  IVSM  merging  algorithm  [3],  the  current  specific  hypothesis  must  be  pruned. 
IVSM,  therefore,  does  not  offer  a  solution  for  this  particular  exercise. 

2.3  The  Parallel  Based  Version  Space  Learning 

Another  recent  research  into  VSA  is  Parallel  Based  Version  Space  (PBVS)  learn¬ 
ing  [4].  Like  the  IVSM  approach,  PBVS  also  uses  a  version  space  merging  al¬ 
gorithm,  except  that  PBVS  divides  the  entire  set  of  training  instances  into  two 
groups  and  constructs  two  version  spaces  simultaneously  from  each  group,  and 
then  merges  these  two  version  spaces  as  IVSM  does.  Figure  3  shows  the  PBVS 
learning  process  using  the  the  same  six  instances.  Again  when  PBVS  merges  the 
VSi  into  V52,  the  resulting  boundary  sets  are  empty.  Therefore,  PBVS  learning 
fails  to  learn  this  set  of  training  instances  due  to  the  same  reason  that  causes 
the  IVSM  learning  fails. 


vs, 


© 

.7,M 

I  con 

(g)  @  © 


Nhlsupper.expetisive, 
’  ’  ’  bread,  cofTec,  Tim’s,  0] 


Hunch,  ?,?,?.?,11  I?,  ?,  rice,  ?,  ?,  1|  [?,?,?,?,Sairt’s,ll 


pruned 


I?,?,  rice,?.  Ml 


pruned 


T  P2:  (supper,  cheap, 
(lunch,  expensive,  rice,  rke,  tea,  Tim’s,  1] 

coffee,  Sam’s,  1] 

PI:  (lunch,  expensive,  rice,  coffee,  Sam’s,  1] 


I?,?,?,  ?,?,?! 


(?,?,?,?,?,  11 

P5;  (supper,cheap,  rice, 
coffee,  Sam’s,  11 
(?,?,  ?,tea,?,ll 

P4:  (supper,  expenrive, 
rice,  tea,  Bob’s,  11 
(breakfast,  cheap 
bread,  tea,  Tim’s,ll 
P3;  (breakfast,  cheap,  bread,  tea,  Tim’s,  11 


VSt  merge  VS2 

© 

(?,  ?,  rice,?,?,  11 
pruned 


pruned 

(?,?,?,?,  7, 11 

© 

I  HO  solutions 


Fig.  3.  The  PBVS  Approach  after  Processing  the  Fourth  Instance 


3  The  Iterated  Version  Space  Learning 

The  allergy  example  is  simple  and  can  be  described  with  two  hypotheses.  But 
when  the  number  of  training  instances,  attributes,  and  classes  are  getting  larger 
and  larger,  it  becomes  more  and  more  difficult  to  detect  which  attribute  value 
would  be  a  true  feature  that  distinguishes  instances  of  different  classes.  However, 
VSA  has  already  provided  a  natural  way  of  separating  different  features.  That 
is,  whenever  VSA  collapses,  the  search  has  encountered  a  new  feature.  This  is 
one  of  the  new  idea  behined  I  VS  A. 


3.1  Learning  the  Allergy  Example  with  I  VS  A 

Before  showing  the  detailed  algorithm  and  approach,  let  us  apply  the  same  six 
allergy  instances  to  IVSA.  As  Figure  1  shows  when  the  version  space  is  collapsed 
by  processing  P3,  instead  of  failing,  IVSA  first  collects  G3  and  S2  as  candidate 


476 


Fig.  4.  Using  IVSA  for  the  Allergy  Example 


hypotheses,  and  then  constructs  a  new  version  space  with  P3  to  learn  a  different 
feature  of  the  same  concept.  When  all  six  training  instances  have  been  processed, 
IVSA  has  collected  three  candidate  hypotheses:  [?,  ?,  rice,  ?,  ?,  1];  [?,  ?,  ?,  ?,  ?, 
1];  and  [?,?,?,  tea,  ?,  1].  These  candidate  hypotheses  then  are  evaluated  using 

Ri  =  where  E'^  and  E~  are  sets  of  all  positive  and  negative  training 

instances  respectively.  E^  C  E'^  is  a  set  of  positive  instances  covered  by  the  ith 
candidate  hypothesis,  and  E~  C  E~  is  the  set  of  negative  instances  covered  by 
the  same  candidate  hypothesis.  For  the  allergy  example,  H2  =  0,  and 

723  =  Therefore,  [?,  ?,  rice,  ?,?,!]  and  [?,?,?,  tea,  ?,  1]  are  selected  as  the 
concept  description:  ((A3  =  rice)  V  (A4  =  tea))  allergy. 


3.2  Learning  from  Noisy  Training  Instances 

When  training  instances  contain  noise,  the  noise  interferes  or  even  stops  the 
learning.  With  IVSA,  noisy  training  instances  are  simply  ignored.  Here  we  use 
the  same  allergy  example  in  Section  2.1  plus  a  noise  instance  N2  =  (supper, 
cheap,  rice,  tea,  Tim’s,  0).  Figure  5  shows  this  learning  process.  In  the  first 
version  space,  IVSA  simply  ignores  N2  just  like  it  ignores  instances  representing 
different  features  such  as  P3  in  Figure  4  in  the  second  version  space.  Because 
N2  is  negative,  IVSA  amalgamates  the  second  version  space  with  7^3.  But  if 
the  incorrect  instances  was  classified  as  possitive,  IVSA  would  start  with  this 
instance  and  later  the  hypothesis  generated  from  this  noisy  instance  would  be 
discarded.  The  learned  concept  description  does  not  interfered  by  N2  because 
IVSA  recognizes  that  N2  does  not  represent  the  feature  of  the  concept. 
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Fig.  5.  Learning  Noisy  Trciining  Instances  with  I  VS  A 


3.3  The  IVSA  Model 

Learning  a  concept  is  similar  to  assembling  a  multi-dimensional  jigsaw  puzzle 
from  a  large  selection  of  possible  pieces.  The  target  concept  can  be  viewed  as  the 
puzzle  and  an  ordered  list  of  disjunctive  hypotheses  can  be  viewed  as  groups  of 
puzzle  pieces.  One  method  of  solving  this  problem  is  to  repeatedly  generate  any 
possible  missing  pieces  and  add  them  to  the  puzzle  until  it  is  complete.  IVSA  is 
based  on  this  puzzle  assembling  method. 

As  shown  in  Figure  6,  IVSA  contains  the  Example  Analyser,  Hypothesis 
Generator,  Assembler,  and  Remover.  The  Example  Analyser  provides  statistical 
evaluation  for  each  attribute  value  provided  by  the  instance  space  to  determine 
the  order  of  input  trining  instances.  The  Hypothesis  Generator  produces  a  set  of 
candidate  hypotheses  from  the  given  set  of  training  instances.  The  Hypothesis 
Assembler  repeatedly  selects  the  most  promising  hypothesis  from  a  large  num¬ 
ber  of  candidate  hypotheses  according  to  the  statistical  evaluation  provided  by 
the  Example  Analyser,  and  then  tests  this  hypothesis  in  each  position  in  a  list 
of  accepted  hypotheses.  If  adding  a  new  hypothesis  increases  concept  coverage, 
it  is  placed  in  the  position  that  causes  the  greatest  increase;  otherwise  this  hy¬ 
pothesis  is  discarded.  After  the  candidate  hypotheses  have  been  processed,  the 
list  of  accepted  hypotheses  is  examined  by  the  Hypothesis  Remover  to  see  if  any 
of  the  hypotheses  can  be  removed  without  reducing  accuracy.  If  the  learning  ac¬ 
curacy  is  satisfactory,  the  accepted  hypothesis  set  becomes  the  learned  concept 
description.  Otherwise,  the  set  of  incorrectly  translated  instances  are  fed  back 
to  the  generator,  and  a  new  learning  cycle  starts. 
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Fig.  6.  The  IVSA  Model 


4  Experimental  Results  on  UCI  Databases 


IVSA  is  tested  on  some  machine  learning  databases  [10].  To  demonstrate  the 
consistency  of  IVSA,  a  ten-fold  cross  validation  test  is  used,  the  cross  validation 
test  is  defined  as  follows: 

Definition  1.  Let  I  he  the  set  of  positive  and  negative  instances  given, 
i  he  the  index  for  10  ten- fold  tests,  and  j  he  the  index  for  test  instances, 

then  =  {xjlajj- 6 ^  and 

{Traiui  =  {/  — 

That  is,  for  each  fold  of  the  test,  use  90%  of  instances  to  train  the  system 
and  then  with  the  rules  learned  from  the  90%  instances,  testing  on  10%  unseen 
instances. 


4.1  Learning  the  Mushroom  Database 

The  mushroom  database  [7]  has  a  total  of  8,124  entries  (tuples  or  instances). 
Each  tuple  has  22  feature  attributes  and  one  decision  attribute.  The  22  feature 
attributes  have  2-5  values  and  the  decision  attribute  has  two  values  (or  classes) 
‘p’  (poison)  or  ‘e’  (eatable).  Because  the  mushroom  database  is  noise-free,  any 
machine  learning  program  should  be  able  to  learn  it  accurately.  For  example, 
STAGGER  “asymptoted  to  95%  classification  accuracy  after  reviewing  1,000 
instances”  [8],  HILLARY  has  learned  1,000  instances  and  reported  an  average 
accuracy  about  90%  on  ten  runs  [5] ,  a  back  propagation  network  developed  in 
[2]  has  generated  ‘crisp  logical  rules’  that  give  correct  classification  of  99.41%, 
and  variant  decision  tree  methods  used  in  [11]  have  100%  accuracy  by  a  ten-fold 
cross  validation  test  [11],  With  IVSA,  the  predictive  accuracy  shown  in  Figure 
1  on  the  mushroom  database  has  reached  100%  with  9  rules. 
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Table  1.  Ten-fold  Tests  on  Mushroom  Data  (CPU;  MIPS  R4400) 


Run 

Number 

Number  of  Instances 

Accuracies 

Number 
of  Rules 

CPU  Time 
(h/m/s) 

10% 

Training 

Testing 

1 

7,311 

813 

100.00% 

100.00% 

9 

01/42/14 

2 

7,311 

813 

100.00% 

100.00% 

9 

02/09/42 

3 

7,311 

813 

100.00% 

100.00% 

9 

01/45/41 

4 

7,311 

813 

100.00% 

100.00% 

9 

01/53/12 

5 

7,312 

812 

100.00% 

100.00% 

9 

01/40/58 

6 

7,312 

812 

100.00% 

100.00% 

9 

02/30/08 

7 

7,312 

812 

100.00% 

100.00% 

9| 

01/46/51 

8 

7,312 

812 

100.00% 

100.00% 

9 

01/59/00 

9 

7,312 

812 

100.00% 

100.00% 

8 

01/46/40 

10 

7,312 

812 

100.00% 

100.00% 

9 

01/56/16 

Ave. 

7,312 

812 

100.00% 

100.00% 

9 

01/55/04 

S.D. 

0.49 

0.49 

0.00 

00.00 

0.30 

859.94 

4.2  Learning  the  Monk’s  Databases 

The  Monk’s  Databases  contains  three  sets:  Monk-1,  Monk-2,  and  Monk-3.  Each 
of  the  three  sets  is  originally  partitioned  into  training  and  testing  sets  [10]  [9]. 
I  VS  A  is  trained  and  tested  on  Monk-1,  Monk-2,  and  Monk-3.  In  Table  2,  the 
experiment  shows  that  5,  61,  and  12  rules  learned  from  Monk-1,  Monk-2,  and 
Monk-3  databases  gives  100%,  81.02%,  and  96.30%  classification  accuracies  on 
three  sets  of  432  previously  unseen  instances. 

Table  2.  Tests  on  Monk’s  Databases  (CPU:  296  MHz  SUNW,  UltraSPARC-II) 


Data 

Base 

Instances 

Accuracy 

#of 

Rules 

CPU  Time 
(seconds) 

TVciining 

Testing 

Training 

Testing 

Monk-1 

124 

432 

100.00% 

100% 

5 

3 

Monk- 2 
Monk-3 

169 

432 

100.00% 

81.02% 

61 

38 

122 

432 

100.00% 

96.30% 

12 

5 

Rules  learned  from  Monk-1,  (2  2  ?  ?  ?  ?  1),  (3  3  ?  ?  ?  ?  1),  (1  1  ?  ?  ?  ?  1), 
(????  1?  1),  (??????  0),  show  exactly  the  desired  concept  description 
with  minimum  number  of  rule  allowed  by  the  concept  language,  which  can  be 
rewritten  as:  [headshape  —  bodyjshape)  V  {jacket  =  red)  — >  monk.  For  the 
Monk-2  database,  61  rules  learned  which  is  relatively  large  compared  with  the 
other  two  sets  (Monk-1  and  Monk-3)  due  to  a  highly  disjunctive  (or  irregular) 
concept.  However,  it  can  be  improved  with  more  statistical  analysis  or  some 
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improved  instance  space  (or  representation  space)  shown  in  [1]  the  predictive 
accuracy  can  be  as  high  as  100%  [Bloedorm  et  al.,  1996,  p.l09]),  although  this 
method  is  highly  specified  for  only  Monk-2  database.  Twelve  rules  are  learned 
from  the  Monk-3  database  with  96.3%  classification  accuracy  despite  5%  noise 
added  to  the  Monk-3  training  instances:  (1  11131  0),  (1  21231  0),  (2  212 
2  1  0),  (2  2  1  3  3  1  0),  (2  2  1  3  3  2  0),  (2  3  1  1  3  1  1),  (3  3  1  1  3  2  1),  (3  3  1  1  4 
1  1),  (?  ?  ?  ?  4  ?  0),  (?  1  ?  ?  ?  ?  1),  (?  2  ?  ?  ?  ?  1),  (?  ?  ?  ?  ?  ?  0) 

4.3  Learning  English  Pronunciation  Databases 

IVSA  has  been  applied  to  learn  English  pronunciation  rules  [13].  The  task  is 
to  provide  a  set  of  rules  that  transform  input  English  words  into  sound  sym¬ 
bols  using  four  steps:  (1)  decompose  words  into  graphemes,  (2)  form  syllables 
from  graphemes,  (3)  stress  marking  on  syllables,  and  (4)  transform  them  into  a 
sequence  of  sound  symbols.  Learning  and  testing  results  are  shown  in  Table  3. 


Table  3.  Learning  and  Testing  Results  for  Individual  Steps 


Step 

Learning  Accuracy 

Testing  Accuracy 

#of 

Inst. 

Words 

Inst. 

Words 

Inst. 

Words 

Inst. 

Words 

Rules 

(1) 

118,236 

17,951 

99.58% 

99.19% 

13,050 

1,995 

98.18% 

94.89% 

1,030 

(2) 

56,325 

23,684 

97.23% 

96.34% 

6,241 

2,656 

96.36% 

95.41% 

248 

(3) 

56,325 

23,684 

78.30% 

72.26% 

6,241 

2,656 

77.95% 

72.78% 

2,080 

(4) 

118,236 

17,951 

98.14% 

95.31% 

16,418 

2,656 

96.93% 

92.23% 

1,971 

5  Conclusions 

We  have  presented  a  new  concept  learning  method  IVSA,  its  approach,  and 
test  results.  Our  analysis  of  previous  research  shows  that  the  empty  version 
space  signals  a  new  feature  of  the  same  target  concept  presented  by  a  particular 
instance.  The  hypotheses  generated  by  previous  version  spaces  belong  to  one 
region  of  the  target  concept  while  the  current  hypotheses  generated  by  a  new 
version  space  belong  to  another  region  of  the  same  concept.  IVSA  takes  the 
advantage  of  an  empty  version  space,  using  it  to  divide  the  regions  of  a  concept, 
and  correctly  handles  noisy  training  instances. 

A  concept  description  can  be  divided  into  regions,  and  each  region  can  be 
represented  by  a  subset  of  training  instances.  These  subsets  can  be  collected 
according  to  the  statistical  analysis  on  each  attribute  value  provided  by  the 
Example  Analyser.  The  technique  of  re-arranging  the  order  of  training  instances 
according  to  the  importance  of  a  particular  attribute  value  provides  a  practical 
method  to  overcome  order  bias  dependency  of  the  training  instances. 
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The  demonstration  on  learning  noisy  training  instances  shows  that  I  VS  A  has 
strong  immunity  to  noisy  data,  and  has  the  ability  to  learn  disjunctive  concept. 
The  preliminary  experimental  results  show  that  rules  learned  by  I  VS  A  obtain 
high  accuracy  when  applied  to  previously  unseen  instances.  In  the  future,  we 
will  intensively  test  IVSA  with  additional  databases  and  improve  the  Example 
Analyser  to  obtain  higher  learning  speed  and  smaller  numbers  of  rules. 
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Abstract.  We  describe  statistical  and  empirical  rule  quality  formulas 
and  present  an  empirical  comparison  of  them  on  standard  machine  learn¬ 
ing  datasets.  Prom  the  experimental  results,  a  set  of  formula-behavior 
rules  are  generated  which  show  relationships  between  a  formula’s  per¬ 
formance  and  dataset  characteristics.  These  formula-behavior  rules  are 
combined  into  formula-selection  rules  which  can  be  used  in  a  rule  in¬ 
duction  system  to  select  a  rule  quality  formula  before  rule  induction. 

1  Introduction 

A  rule  induction  system  generates  decision  rules  from  a  set  of  data.  The  decision 
rules  determine  the  performance  of  a  classifier  that  exploits  the  rules  to  classify 
unseen  objects.  It  is  thus  important  for  a  rule  induction  system  to  generate  deci¬ 
sion  rules  with  high  predictability  or  reliability.  These  properties  are  commonly 
measured  by  a  function  called  rule  quality.  A  rule  quality  measure  is  needed  in 
both  rule  induction  and  classification.  A  rule  induction  process  is  usually  consid¬ 
ered  as  a  search  over  a  hypothesis  space  of  possible  rules  for  a  decision  rule  that 
satisfies  some  criterion.  In  the  rule  induction  process  that  employs  general-to- 
specific  search,  a  rule  quality  measure  can  be  used  as  a  search  heuristic  to  select 
attribute- value  pairs  in  the  rule  specialization  process;  and/or  it  can  be  employed 
as  a  significance  measure  to  stop  further  specialization.  The  main  reason  to  focus 
special  attention  on  the  stopping  criterion  can  be  found  in  the  studies  on  small 
disjunct  problems  [9].  The  studies  indicated  that  small  disjuncts,  which  cover  a 
small  number  of  training  examples,  are  much  more  error  prone  than  large  dis¬ 
juncts.  To  prevent  small  disjuncts,  a  stopping  criterion  based  on  rule  consistency 
(i.e.,  the  rule  is  consistent  with  the  training  examples)  is  not  suggested  for  use 
in  rule  induction.  Other  criteria,  such  as  the  G2  likelihood  ratio  statistic  as  used 
in  CN2  [7]  and  the  degree  of  logical  sufficiency  as  used  in  HYDRA  [1],  have  been 
proposed  to  “pre-prune”  a  rule  to  avoid  overspecialization.  Some  rule  induction 
systems,  such  as  C4.5  [12]  and  ELEM2  [2],  use  an  alternative  strategy  to  prevent 
the  small  disjunct  problem.  In  these  systems,  the  rule  specialization  process  is 
allowed  to  run  to  completion  (i.e.,  it  forms  a  rule  that  is  consistent  with  the 
training  data  or  as  consistent  as  possible)  and  “post-prunes”  overfitted  rules  by 
removing  components  that  are  deemed  unreliable.  Similar  to  pre-pruning,  a  cri¬ 
terion  is  needed  in  post-pruning  to  determine  when  to  stop  this  generalization 
process.  A  rule  quality  measure  is  also  needed  in  classification.  It  is  possible  that 
an  unseen  example  satisfies  multiple  decision  rules  that  indicate  different  classes. 
In  this  situation,  some  conflict  resolution  scheme  must  be  applied  to  assign  the 
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unseen  object  to  the  most  appropriate  class.  It  is  therefore  useful  for  each  rule 
to  be  associated  with  a  numerical  factor  representing  its  classification  power, 
its  reliability,  etc.  We  survey  and  evaluate  statistical  and  empirical  rule  quality 
measures,  some  of  which  have  been  discussed  by  Bruha  [5].  In  our  evaluation, 
ELEM2  [2]  is  used  as  the  basic  learning  and  classification  algorithms.  We  re¬ 
port  the  experimental  results  from  using  these  formulas  in  EL  EM  2  and  compare 
the  results  by  indicating  the  significance  level  of  the  difference  between  each 
pair  of  the  formulas.  In  addition,  the  relationship  between  the  performance  of  a 
formula  and  a  dataset  is  obtained  by  automatically  generating  formula-behavior 
rules  from  the  experimental  results.  The  formula-behavior  rules  are  further  com¬ 
bined  into  formula-selection  rules  which  can  be  employed  by  ELEM2  to  select 
a  rule  quality  formula  before  rule  induction.  We  report  the  experimental  results 
showing  the  effects  of  formula-selection  on  ELEM2’s  predictive  performance. 

2  Rule  Quality  Measures 

Many  rule  quality  measures  are  derived  by  analysing  the  relationship  between  a 
decision  rule  R  and  a  cl£iss  C.  The  relationship  can  be  depicted  by  a  2  x  2  contin¬ 
gency  table  [5],  which  consists  of  a  cross-tabulation  of  categories  of  observations 
with  the  frequency  for  each  cross-classification  shown: 


Class  C 

Not  class  C 

Covered  by  rule  R 

Tire 

TlrB 

n  r 

Not  covered  by  R 

Tlf'c 

fir's 

Tip 

Tic 

Tls 

N 

where  n^c  is  the  number  of  training  examples  covered  by  rule  R  and  belonging  to 
clcLss  C;  Urc  is  the  number  of  training  examples  covered  by  R  but  not  belonging 
to  C,  etc;  N  is  the  total  number  of  training  examples;  nr,  nr,  nc  and  nc  are 


marginal  totals,  e.g.,  =  n^c  +  Tire, 

by  R.  The  contingency  table  can  als( 
absolute  frequencies  as  follows: 

which 
>  be  pi 

Class  C 

is  the  nun 
resented  u 

Not  class  C 

nber  of  examples  covered 
sing  relative  rather  than 

Covered  by  rule  R 

fre 

frS 

77 

Not  covered  by  R 

fpc 

fps 

fp 

where  fre-^.fre-^,  and  so  on 

fc 

U 

1 

2.1  Measures  of  Association 

A  measure  of  cissociation  indicates  a  relationship  between  the  classification  for 
the  columns  and  the  classification  for  the  rows  in  the  2x2  contingency  table. 
Pearson  Statistic  assumes  contingency  table  cell  frequencies  are  propor¬ 
tional  to  the  marginal  totals  if  column  and  row  classifications  are  independent, 
and  is  given  by 

^.2  _  K  - 

Tig 

where  Uo  is  the  observed  absolute  frequency  of  examples  in  a  cell,  and  rig  is  the 
expected  absolute  frequency  of  examples  for  the  cell.  A  computational  formula 
for  C3,n  be  obtained  using  only  the  values  in  the  contingency  table  with 
absolute  frequencies  [6]:  ‘  value  measures  whether 

the  classification  of  examples  by  rule  R  and  one  by  class  C  are  related.  The  lower 
the  x^  value,  the  more  likely  the  correlation  between  R  and  C  is  due  to  chance. 
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G2  Likelihood  Ratio  Statistic  measures  the  distance  between  the  observed 
frequency  distribution  of  examples  among  classes  satisfying  rule  R  and  the  ex¬ 
pected  frequency  distribution  of  the  same  number  of  examples  where  rule  R 
selects  examples  randomly.  The  value  of  this  statistic  can  be  computed  as 

n/^rcj  Tlrc^  ,  ^rc  _ 

G2  =  2( — loge - 1 - logs - )• 

Ur  TlrTlc  Ur  Ur  Tic 

The  lower  the  G2  value,  the  more  likely  the  apparent  association  between  the 
two  distributions  is  due  to  chance. 

2.2  Measures  of  Agreement 

A  measure  of  agreement  concerns  the  main  diagonal  contingency  table  cells. 
Cohen’s  Formula  Cohen  [8]  suggests  comparing  the  actual  agreement  on  the 
main  diagonal  {frc  +  ffc)  with  the  chance  agreement  (frfc  +  frfc)  by  using  the 
normalized  difference  of  the  two: 

^  _  frc  "b  ffc  ~~  {frfc  “b  frfc) 

QCohen- 

When  both  elements  frc  and  ffc  are  reasonably  large,  Cohen’s  statistic  gives  a 
higher  value  which  indicates  the  agreement  on  the  main  diagonal. 

Coleman’s  Formula  Coleman  [3,  5]  defines  a  measure  of  agreement  between 
the  first  column  and  any  particular  row  in  the  contingency  table.  Bruha  [5] 
modifies  Coleman’s  measure  to  define  rule  quality,  which  actually  corresponds 
to  the  agreement  on  the  upper-left  element  of  the  contingency  table.  The  formula 
normalizes  the  difference  between  actual  and  chance  agreement: 


n  —  -/T-c  jrjc 

^ColemaTi  ^  ^  ^  * 

2.3  Measure  of  Information 

Given  class  C,  the  amount  of  information  necessary  to  correctly  classify  an  in¬ 
stance  into  class  C  whose  prior  probability  is  P(C)  is  defined  as  -log2P(C). 
Given  rule  R,  the  amount  of  information  we  need  to  correctly  classify  an  in¬ 
stance  into  class  C  is  -log2P(C\R),  where  P{C\R)  is  the  posterior  proba¬ 
bility  of  C  given  R.  Thus,  the  amount  of  information  obtained  by  rule  R  is 
—log2P(C)  +  log2P(C\R).  This  value  is  called  information  score  [10].  It  mea¬ 
sures  the  amount  of  information  R  contributes  and  can  be  expressed  as 


♦  Tiff* 

2,4  Measure  of  LogictJ  sufficiency 

The  logical  sufficiency  measure  is  a  standard  likelihood  ratio  statistic,  which  has 
been  applied  to  measure  rule  quality  [1].  Given  a  rule  R  and  a  class  C,  the  degree 
of  logical  sufficiency  of  R  with  respect  to  C  is  defined  by 

P(R\C) 

“  -P(R|C) 

where  P  denotes  probability.  A  rule  for  which  Qls  is  large  means  that  the 
observation  of  R  is  encouraging  for  the  class  <7  -  in  the  extreme  case  of  Qls 
approaching  infinity,  R  is  sufficient  to  establish  C  in  a  strict  logical  sense.  On  the 
other  hand,  if  Qls  is  much  less  than  unity,  the  observation  of  R  is  discouraging 

for  C.  Using  frequencies  to  estimate  the  probabilities,  we  have  Qls  = 
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2.5  Measure  of  Discrimination 

Another  statistical  rule  quality  formula  is  the  measure  of  discrimination,  which 
is  applied  in  ELEM2  [2].  The  formula  was  inspired  by  a  query  term  weight¬ 
ing  formula  used  in  the  probability-based  information  retrieval.  The  formula 
measures  the  extent  to  which  a  query  term  can  discriminate  relevant  and  non- 
relevant  documents  [13].  If  we  consider  a  rule  ij  as  a  query  term  in  an  IR  setting, 
positive  examples  of  class  C  cis  relevant  documents,  and  negative  examples  as 
non-relevant  documents,  then  the  following  formula  can  be  used  to  measure  the 
extent  to  which  rule  R  discriminates  positive  and  negative  examples  of  class  C ; 


Qmd  —  log 


P(fl|C)(l-.P(it|C)) 

P(JJ|C)(l-P(iJ|C)) 


where  P  denotes  probability.  The  formula  represents  the  ratio  between  the  rule’s 
positive  and  negative  odds  and  can  be  estimated  as  Qmd  = 

nfig 

2.6  Empirical  Formulas 

Some  rule  quality  formulas  are  not  based  on  statistical  or  information  theories, 
but  from  intuitive  logic.  Bruha  [5]  refers  to  these  as  empirical  formulas.  We  de¬ 
scribe  two  empirical  formulas  that  combine  two  characteristics  of  a  rule:  consis¬ 
tency  and  coverage.  Using  the  elements  of  the  contingency  table,  the  consistency 
of  a  rule  R  can  be  defined  as  cons(R)  =  ^  and  its  coverage  as  cover  (R)  = 
Weighted  Sum  of  Consistency  and  Coverage  Michalski  [11]  proposes  to 
use  the  weighted  sum  of  consistency  and  coverage  as  a  measure  of  rule  quality: 


Qws  =  wi  X  cons{R)  W2  'X  cover(R) 


where  wi  and  W2  are  user-defined  weights  with  their  values  belonging  to  (0, 1) 
and  summed  to  1.  This  formula  is  applied  in  an  incremental  learning  system 
YAILS  [14].  The  weights  in  YAILS  are  specified  automatically  as:  wi  =  0.5  + 
~cons(R)  and  W2  =  0.5— |cons(R).  These  weights  are  dependent  on  consistency. 
The  larger  the  consistency,  the  more  influence  consistency  has  on  rule  quality. 
Product  of  Consistency  and  Coverage  Brazdil  and  Torgo  [4]  propose  to 
use  a  product  of  consistency  and  coverage  as  rule  quality: 

Qprod  =  cons(R)  X  f  (cover (R)) 


where  /  is  an  increasing  function.  The  authors  conducted  a  large  number  of 
experiments  and  chose  to  use  the  following  form  of  /:  f{x)  =  e^“^.  This  setting 
of  /  makes  the  difference  in  coverage  have  smaller  influence  on  rule  quality,  which 
results  in  the  rule  quality  formula  to  prefer  consistency. 

3  Experiments  with  Rule  Quality  Measures 

3.1  The  Learning  System 

ELEM2  uses  a  sequential  covering  learning  strategy;  it  reduces  the  problem  of 
learning  a  disjunctive  set  of  rules  to  a  sequence  of  learning  a  single  conjunctive 
rule  that  covers  a  subset  of  positive  examples.  Learning  a  conjunctive  rule  begins 
by  considering  the  most  general  rule  precondition,  then  greedily  searches  for  an 
attribute- value  pair  that  is  most  relevant  to  class  C  according  to  the  following 


486 


function:  SIGc{av)  =  P(av){P{C\av)  -  P(C)),  where  av  is  an  attribute-value 
pair  and  P  denotes  probability.  The  selected  attribute-value  pair  is  then  added 
to  the  rule  precondition  as  a  conjunct.  The  process  is  repeated  until  the  rule 
is  as  consistent  with  the  training  data  as  possible.  Since  a  “consistent”  rule 
may  be  a  small  disjunct  that  overfits  the  training  data,  ELEM2  “post-prunes” 
the  rule  after  the  initial  search  for  the  rule  is  complete.  To  post-prune  a  rule, 
ELEM2  first  computes  a  rule  quality  value  according  to  the  formula  of  measure 
of  discrimination  Qmd  (Section  2.5).  It  then  checks  each  attribute- value  pair  in 
the  rule  in  the  reverse  order  in  which  they  were  selected  to  see  if  removal  of  a 
pair  will  decrease  the  rule  quality  value.  If  not,  the  pair  is  removed. 

After  rules  are  induced  for  all  classes,  the  rules  can  be  used  to  classify 
new  examples.  The  classification  procedure  in  ELEM2  considers  three  possi¬ 
ble  cases:  (1)  Single  match.  The  new  example  satisfies  one  or  more  rules  of  the 
same  class.  In  this  case,  the  example  is  classified  to  that  class.  (2)  Multiple 
match.  The  new  example  satisfies  more  than  one  rules  of  different  classes.  In 
this  case,  ELEM2  computes  a  decision  score  for  each  of  the  matched  classes  as: 
DS(C)  =  where  r,-  is  a  matched  rule  that  indicates  class  C,  k 

is  the  number  of  this  kind  of  rules,  and  is  the  rule  quality  of  r,-.  The 

new  example  is  then  classified  into  the  class  with  the  highest  decision  score. 
(3)  No  match.  The  new  example  is  not  covered  by  any  rule.  Partial  matching 
is  conducted.  If  the  partially-matched  rules  do  not  agree  on  classes,  a  partial 
matching  score  between  new  example  e  and  a  partially-matched  rule  r,-  with  n 
at  tribute- value  pairs,  m  of  which  match  the  corresponding  attributes  of  e,  is 
computed  as  PMS{ri)  =  —  x  A  decision  score  for  a  class  C  is  com¬ 

puted  as  DS{C)  =  PMS(ri),  where  k  is  the  number  of  partially-matched 
rules  indicating  class  C.  The  new  example  is  classified  into  the  class  with  the 
highest  decision  score. 

3.2  Experimental  Design 

We  evaluate  the  rule  quality  formulas  described  in  Section  2  by  determining 
how  rule  quality  formulas  affect  the  predictive  performance  of  ELEM2.  In  our 
experiments,  we  run  versions  of  ELEM2,  each  of  which  uses  a  different  rule  qual¬ 
ity  formula.  The  formulas:  QmDi  Q Cohen-)  Q Coleman-)  QlS-)  QlS-)  QwS-)  and  Qprod 
are  used  exactly  as  described  in  Section  2.  The  statistic  is  used  in  two  ways: 
(1)  0^2  :  In  post-pruning,  the  removal  of  an  attribute- value  pair  depends  on 
whether '^the  rule  quality  value  after  removing  an  attribute-value  pair  is  greater 
than  tabular  X2  value  for  the  significance  level  of  0.05  with  one 

degree  of  freedom.  If  the  calculated  value  is  greater  than  X%5,  then  remove  the 
attribute- value  pair;  otherwise  check  other  pairs  or  stop  post-pruning  if  all  pairs 
have  been  checked.  (2)  Qx^o5+'  post-pruning,  an  attribute- value  pair  is  re¬ 
moved  if  and  only  if  the  rule  quality  value  after  removing  the  pair  is  greater 
than  x%5  and  no  less  than  the  rule  quality  value  before  removing  the  pair.  The 
G2  statistic,  denoted  as  Qg2.o6+>  is  used  in  the  same  way  as 

Our  experiments  are  conducted  using  22  benchmark  datasets  from  the  UCI 
Repository  of  Machine  Learning  database.  The  datasets  represent  a  mixture  of 
characteristics  shown  in  Table  1.  ELEM2  removes  all  the  examples  containing 
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missing  values  before  rule  induction.  For  datasets  with  missing  values  (such  as 
“crx”)  ,  the  number  of  examples  shown  in  Table  1  is  the  number  after  removal. 


Datasets 

Number  of 

Class 

Distribution 

Domain 

attributes 

examples 

1  abalone 

3 

8 

4177 

Even 

Predicting  the  age  of  abalone 
from  physical  measurements 

2  australia 

2 

14 

690 

Even 

Credit  card  application  approval 

3  balance-scale 

3 

4 

625 

Uneven 

Balance  scale  classification 

4  breast-cancer 

2 

9 

683 

Uneven 

Medical  diagnosis 

5  bupa 

2 

6 

345 

Uneven 

Liver  disorder  database 

6  crx 

2 

15 

653 

Uneven 

Credit  card  applications 

7  diabetes 

2 

8 

768 

Uneven 

Medical  diagnosis 

8  ecoli 

8 

7 

336 

Uneven 

Predicting  protein  localization  sites 

9  german 

2 

20 

1000 

Uneven 

Credit  database  to  classify  people 
as  good  or  bad  credit  risks 

10  glass 

6 

9 

214 

Uneven 

Glass  identification  for 
criminologicsd  investigation 

11  heart 

2 

13 

270 

Uneven 

Heart  disease  diagnosis 

12  ionosphere 

2 

33 

351 

Uneven 

Classification  of  radar  returns 

13  iris 

3 

4 

150 

Even 

Iris  plant  classification 

14  lenses 

3 

4 

24 

Uneven 

Database  for  fitting  contact  lenses 

15  optdigits 

10 

64 

3823 

Even 

Optical  recognition  of  handwritten 
digits 

16  pendigits 

10 

16 

7494 

Even 

Pen-based  recognition  of  handwritten 
digits 

17  post-operative 

3 

8 

87 

Uneven 

Postoperative  Patient  Data 

18  segment 

7 

18 

2310 

Even 

image  segmentation 

19  tic-tac-toe 

2 

9 

958 

Uneven 

Tic-Tac-Toe  Endgame  database 

20  wine 

3 

13 

178 

Uneven 

Wine  recognition  data 

21  yeast 

10 

8 

1484 

Uneven 

Predicting  protein  localization  sites 

22  zoo 

7 

16 

101 

Uneven 

Animal  classification 

Table  1.  Description  of  Datasets. 


3.3  Results 

On  each  dataset,  we  conduct  a  ten-fold  cross-validation  of  a  rule  quality  measure 
using  EL  EM  2.  The  results  in  terms  of  predictive  accuracy  mean  over  the  10 
runs  on  each  dataset  for  each  formula  are  shown  in  Figure  1.  The  average  of  the 


accuracy  means  for  each  formula  over  the  22  datasets  is  shown  in  Table  2,  where 
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Qws 

Qmd 

Qls 

Qcoleman 

^<32  06+ 

Qis 

Qcohcn 

Q  2 
*,06 

80,66 

79.85 

79.51 

79.05 

Table  2.  Average  of  accuracy  means  for  each  formula  over  the  datasets. 

the  formulas  are  listed  in  decreasing  order  of  average  accuracy  means.  Whether 
a  formula  with  a  higher  average  is  significantly  better  than  a  formula  with  a 
lower  average  is  determined  by  paired  t-tests  using  the  S-Plus  statistics  software. 
The  t-test  results  in  terms  of  p- values  are  reported  in  Table  3.  A  small  p- value 
indicates  that  the  null  hypothesis  (the  difference  between  the  two  formulas  is  due 
to  chance)  should  be  rejected  in  favor  of  the  alternative  at  any  significance  level 
above  the  calculated  value.  The  p- values  that  are  smaller  than  0.05  are  shown 
in  bold-type  to  indicate  that  the  formula  with  higher  average  is  significantly 
better  than  the  formula  with  the  lower  average  at  the  5%  significance  level. 
For  example,  Qws  is  significantly  better  than  Qcoieman^  Og2.o6+>  Q^s-,  Qx!*o6+’ 


Qws 

Qmd 

Qls 

Qcoleman 

ISO 

Qis 

Qcohen 

Qws 

0.0119 

riWiTiv^i 

riWiTTiTH 

Qmd 

0.1323 

0.0188 

Qls 

- 

0.0539 

■tHiIiHtl 

EfiSiE] 

0.149 

. 

- 

NA 

0.0526 

|i2>2i3 

. 

- 

- 

- 

mifm 

0.6947 

0.5621 

0.4325 

0.3962 

- 

- 

- 

- 

- 

NA 

ESb 

0.7512 

0.6316 

Qis  . 

- 

- 

- 

- 

- 

- 

NA 

0.733 

- 

- 

- 

- 

- 

- 

- 

NA 

0.6067 

- 

- 

- 

- 

- 

- 

- 

- 

NA 

_ _  .QB _ 

- 

- 

- 

- 

- 

- 

- 

- 

- 

Table  3.  Significance  levels  (p- values  from  paired  t-test)  of  improvement. 

Q Cohen  and  Qmd  and  Qls  are  significantly  better  than  Qg2.o6+?  Qis  and 

;  and  all  formulas  are  significantly  better  than  Qx\^  significance 

level!  Generally  speaking,  Qws,  Qmd  and  Qls  are  comparable  even  if  their 
performance  does  not  agree  on  a  particular  dataset.  Qcoieman  and  Qprod,  and 
Q^2  and  Qcohen  are  comparable.  Qg2.o6+  ^-nd  Qis  are  not  only  comparable, 
but  also  similar  on  each  particular  dataset,  indicating  that  they  have  similar 
trends  with  regard  to  nrc,nr,nc  and  N  in  the  contingency  table. 

4  Lecirning  from  the  Experimental  Results 
From  our  results,  we  posit  that,  even  if  the  learning  performance  on  some 
datasets  (such  as  breast  cancer  dataset)  is  not  very  sensitive  to  the  rule  quality 
formula  used,  the  performance  greatly  depends  on  the  formula  on  most  of  the 
other  datasets.  It  would  be  desirable  that  we  can  apply  a  ‘‘right”  formula  that 
gives  the  best  performance  among  other  formulas  on  a  particular  dataset.  For 
example,  although  formula  Q^2^^  is  not  a  good  formula  in  general,  it  performs 
better  than  other  formulas  on  some  datasets  such  as  heart  and  lenses.  If  we  can 
find  conditions  under  which  each  formula  leads  to  good  learning  performance,  we 
can  select  “right  formulas”  for  different  datasets  and  can  improve  the  predictive 
performance  of  the  learning  system  further. 
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To  find  out  this  regularity,  we  use  our  learning  system,  i.e.,  ELEM2,  to  learn 
the  formula  selection  rules  from  the  experimental  results  shown  in  the  Ifist  sec¬ 
tion.  The  learning  problem  is  divided  into  (1)  learning  formula-behavior  rules  for 
each  rule  quality  formula  that  describe  the  conditions  under  which  the  formula 
produces  “good”,  “medium”  or  “bad”  results,  and  (2)  combining  the  rules  for 
all  the  formulas  that  describe  the  conditions  under  which  the  formulas  give  the 
“good”  results.  The  resulting  set  of  rules  is  the  formula-selection  rules  that  can 
be  used  by  the  ELEM2  classification  procedure  to  perform  formula  selection. 

4.1  Data  Representation 

To  learn  formula-behavior  rules  we  construct  training  examples  from  Figure  1 
and  Table  1.  First,  on  each  dataset,  we  decide  the  relative  performance  of  each 
formula  as  “good”,  “medium”,  or  “bad”.  For  example,  on  the  abalone  dataset,  we 
say  that  the  formulas  whose  accuracy  mean  is  above  60%  produce  “good”  results; 
the  formulas  whose  accuracy  mean  is  between  56  and  60  produce  “medium”  re¬ 
sults;  and  other  formulas  give  “bad”  results.  Then,  for  each  formula,  we  construct 
a  training  dataset  in  which  an  training  example  describes  the  characteristics  of 
a  dataset  and  the  performance  of  the  formula  on  the  dataset.  Thus,  to  learn 
behavior  rules  for  each  formula,  we  have  22  training  examples.  The  dataset  char¬ 
acteristics  are  described  in  terms  of  number  of  examples,  number  of  attributes, 
number  of  classes  and  the  class  distribution.  Samples  of  training  examples  for 
learning  behavior  rules  of  Qjs  are  shown  in  Table  4. 


Number  of  | 

Class 

Distribution 

Performance 

Examples 

Attributes 

Classes 

4177 

8 

3 

Even 

Good 

690 

14 

2 

Even 

Medium 

625 

4 

3 

Uneven 

Bad 

683 

9 

2 

Uneven 

Medium 

Table  4.  Sample  of  trciining  examples  for  learning  the  behavior  of  a  formula 

4.2  The  Learning  Results 

EL  EM  2  with  its  default  rule  quality  formula  (Qmd)  is  used  to  learn  the  “behav¬ 
ior”  rules  from  the  training  dataset  constructed  for  each  formula.  Table  5  shows 
samples  of  generated  rules  for  each  formula,  where  N  stands  for  the  number  of 
examples,  NofA  is  the  number  of  attributes,  NofC  is  the  number  of  classes,  and 
the  column  “No.  of  Support  Datasets”  means  the  number  of  the  datasets  that 
support  the  corresponding  rule.  These  rules  serve  two  purposes.  We  summarize 
predictive  performance  of  each  formula  in  terms  of  dataset  characteristics.  We 
build  a  set  of  formula-selection  rules  by  combining  all  “good”  rules,  i.e.,  the 
rules  that  predicts  “good”  performance  for  each  formula,  and  use  them  to  select 
a  “right”  rule  quality  formula  for  a  (new)  dataset.  For  formula  selection,  we 
can  use  the  EL  EM  2  classification  procedure  that  takes  formula-selection  rules 
to  classify  a  dataset  into  a  class  of  using  a  particular  formula. 

4.3  ELEM2  with  Multiple  Rule  Quality  Formulas 

With  formula-selection  rules,  we  can  apply  ELEM2’s  classification  procedure  to 
select  a  rule  quality  formula  before  using  ELEM2  to  induce  rules  from  a  dataset. 
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Formula 

Condition 

Decision 

Rule 

Quality 

No.  of  Support 
Datasets 

Qws 

(NofA<20)and(NofC=2) 

Good 

1.23 

8 

{N<  3823)and(classDistr=Even) 

Good 

0.77 

4 

Qmd 

(N>625)and(8<NofA<18) 

Good 

1.56 

6 

(NofC>8) 

Good 

1.07 

3 

Qls 

(ClassDistr=:Even) 

Good 

mm*m 

6 

(N<24) 

Bad 

1 

Qcol^ma  n 

(N>  768)and  (NofA>  8) 

1.52 

5 

(N>1484) 

1.34 

4 

(351<N<683)and(NofA<9) 

ISHI 

1.57 

2 

(N<214)and(NofA<13) 

1.66 

5 

(NofA>20) 

1.05 

2 

(351<N<653) 

1.40 

2 

^<52.06+ 

(N>1484) 

iii 

4 

(NofA>20) 

2 

(NofA<7)and(NofC>2)and(ClassDistr=Uneven) 

BE9 

3 

Qis 

(N>1484) 

4 

(NofA>20) 

2 

(NofA<7)and(NofC>2)and(ClassDistr=Uneven) 

3 

(87<N<178) 

Good 

1.27 

3 

{13<NofA<15) 

Bad 

1.57 

2 

Qcohe.n 

(101<N<1484)and(NofA<8)and(NofC>2) 

Good 

1.34 

4 

(768<N<2310)and(8<NofA<18) 

Bad 

1.40 

2 

Qy^ 

(N<87) 

Good 

1.57 

2 

(9<NofA<14)and(NofC<2) 

Good 

1.57 

2 

(N>87) 

Bad 

1.15 

15 

Table  5.  Formula  Behavior  Rules 


Thus,  ELEM2  can  use  different  formulas  on  different  datasets.  To  show  this 
strategy,  we  conduct  ten-fold  evaluation  of  EL  EM  2  on  the  22  datasets  we  used. 
The  result  is  shown  in  Figure  2,  in  which  the  average  accuracy  mean  from  the 
“flexible”  ELEM2  (labeled  “Combine”  in  the  graph)  is  compared  with  ones  using 
individual  formulas.  We  also  conduct  paired  t-tests  to  see  how  much  the  flexible 


A<3<3«_»r«a<=v'  r\/* •‘^4, 


Fig*  2.  Average  of  accuracy  means  of  each  formula  on  the  22  datasets 

ELEM2  improves  over  EL  EM  2  with  a  single  rule  quality  formula.  The  p- values 
from  the  t-test  are  shown  in  Table  6.  “Combine”  improves  QwSi  Qmd  and  Qls 
at  the  2.5%  significance  level;  and  it  improves  other  formulas  more  significantly 
at  the  0.5%  significance  level. 


Qws 

Qmd 

Qls 

Qcoleman 

Qprod 

^<52  0J5  + 

Qis 

^.06+ 

Qcohen 

■'.06 

Combine 

00 

o 

6 

0.0217 

0.0228 

0.0031 

wiwmsm 

Plililia 

Table  6.  Significance  levels  of  the  improvement  of  “Combine”  over  individuzd  formulas 
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5  Conclusions 

We  have  described  and  evaluated  various  statistical  and  empirical  formulas  for 
defining  rule  quality  mecisures.  The  performance  of  these  formulas  varies  among 
datasets.  The  empirical  formulas,  especially  QwS',  work  very  well.  Among  sta¬ 
tistical  formulas,  Qmd  and  Qls  work  the  best  on  the  tested  datasets  and  are 
comparable  with  Qws-  To  determine  the  regularity  of  a  rule  quality  formula’s 
performance  in  terms  of  dataset  characteristics,  we  used  ELEM2  to  induce  rules 
from  a  dataset  constructed  from  the  experimental  results.  These  rules  provided 
ideas  about  the  situations  in  which  a  formula  leads  to  good,  medium  or  bad 
performance.  These  rules  can  also  be  used  to  automatically  select  a  rule  qual¬ 
ity  formula  before  rule  induction.  Our  experiment  showed  that  this  selection  can 
lead  to  significant  improvement  over  the  rule  induction  system  using  a  single  for¬ 
mula.  Future  work  includes  testing  our  conclusions  on  more  datasets  to  obtain 
more  reliable  formula-selection  rules. 
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Abstract.  A  method  for  constructing  classification  (decision)  systems 
is  presented.  The  use  of  decision  rules  derived  using  rough  set  methods 
as  new  attributes  is  considered.  Neural  networks  are  applied  as  a  tool 
for  construction  of  classifier  over  reconstructed  dataset.  Possible  prof¬ 
its  of  such  an  approach  are  briefly  presented  together  with  results  of 
preliminary  experiments. 


1  Introduction 

In  the  process  of  constructing  classification  (decision)  sy terns  we  have  several 
objectives  in  mind.  Among  others,  we  concern  robustness,  versatility,  adaptive¬ 
ness,  compactness  and  intuitive  understanding  of  produced  solution.  Of  course 
it  is  tough  job  to  fulfill  all  the  expectations,  especially  if  our  system  is  based 
only  on  the  information  contained  in  the  data.  In  this  paper  we  are  trying  to 
address  the  issue  of  compactness  and  adaptiveness  of  a  classifier.  We  propose  a 
method  of  treating  decision  rules  as  a  source  for  new  features.  Using  those  rules 
we  construct  new  set  of  data  that  is  easier  to  classify.  A  simple  artificial  neural 
network  is  used  for  this  purpose 

The  classifier  constructed  in  such  a  way  shows,  according  to  preliminary 
experiments,  some  nice  features.  It  is  usually  smaller  and  simpler  than  rough  set 
classifier  having  comparable  quality.  It  is  also  easier  to  explain  intuitively  as  it 
has  less  components. 

The  paper  begins  with  introduction  of  basic  notions.  Then  some  foundational 
features  of  rule  based  rough  set  classifiers  are  presented.  Next  sections  contain 
short  description  of  proposed  solution  and  the  initial  experimental  results. 

2  Basic  notions 

The  structure  of  data  that  is  subject  of  our  study  is  represented  in  the  form  of 
information  system  [9]  or,  more  precisely,  the  special  case  of  information  system 
called  decision  table. 

Information  system  is  a  pair  of  the  form  A  =  (U,  A)  where  U  is  a  universe 
of  objects  and  A  =  (ai,...,am)  is  a  set  of  attributes  i.e.  mappings  of  the  form 
Q.  .  U  ^  ^  where  14  is  called  value  set  of  the  attribute  Ui.  The  decision 

table  is  also  a  pair  of  the  form  A  =  (U,  A  U  {d})  where  the  major  feature 
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that  is  different  from  the  information  system  is  the  distinguished  attribute  d. 
In  case  of  decision  table  the  attributes  belonging  to  A  are  called  conditional 
attributes  or  simply  conditions  while  d  is  called  decision  (sometimes  decision 
attribute).  We  will  further  assume  that  the  set  of  decision  values  is  finite  and 
by  rank{d)  we  will  refer  to  its  cardinality.  The  th  decision  class  is  a  set  of 
objects  Ci  =  {o  :  d{o)  =  d*},  where  di  is  the  th  decision  value  taken  from 
decision  value  set  Vd  —  {di,  ...,dranjfe(d)} 

For  any  subset  of  attributes  B  C  A  indiscemibility  relation  IND(B)  is  defined 
as  follows: 


xIND{B)y  ^  'ia^Baix)  =  a{y)  (1) 


where  x,y  eU. 

Having  indiscemibility  relation  we  may  define  the  notion  of  reduct.  B  C  A 
is  a  reduct  of  information  system  if  IND{B)  =  IND{A)  and  no  proper  subset 
of  B  has  this  property. 

Decision  rule  is  a  formula  of  the  form 


(fli,  =  ui)  A  ...  A  {oi^  =vk)=>  d-Vd  (2) 

where  1<  <  ...  <  u  <  m,  Uj  G  •  Atomic  subformulae  (0^1  =  ui)  are  called 

conditions.  We  say  that  rule  r  is  applicable  to  object,  or  alternatively,  the  object 
matches  rule,  if  its  attribute  values  satisfy  the  premise  of  the  rule.  With  the 
rule  we  can  connect  some  characteristics.  Support  denoted  as  Suppxir)  is  equal 
to  the  number  of  objects  from  A  for  which  rule  r  applies  correctly  i.e.  premise 
of  rule  is  satisfied  and  the  decision  given  by  rule  is  similar  to  the  one  preset  in 
decision  table.  MatchA(r)  is  the  number  of  objects  in  A  for  which  rule  r  applies 
in  general.  Analogously  the  notion  of  matching  set  for  a  collection  of  rules  may 
be  introduced.  By  Match o)  we  denote  the  subset  M  of  rule  set  R  such  that 
rules  in  M  are  applicable  to  the  object  o^U. 

3  Rule  based  decision  systems 

Among  the  others,  we  may  use  the  decision  (classification)  support  systems  based 
on  rules  derived  from  data.  There  are  several  approaches  to  generate  such  rules. 
They  differ  in  the  way  the  rules  are  generated  as  well  as  in  the  form  of  rule 
representation  and  use.  Nevertheless,  all  the  approaches  have  some  common, 
basic  questions  to  answer.  One  of  them,  probably  most  important  one  while 
classifying  new,  unseen  objects  is  this  about  trustworthness  of  a  rule  or  group  of 
rules.  Depending  on  approach,  there  may  be  several  issues  to  solve  while  deciding 
what  the  decision  for  new-coming  object  should  be. 

Given  a  set  of  decision  rules  R  =  (ri,  ...,rm)  derived  from  data  by  some 
method  and  the  new  object  o^,  we  may  face  several  problems  while  trying  to 
make  decision.  Namely: 

1.  They  may  be  no  rule  in  R  that  is  applicable  to  Of.  In  other  words  the  values 
of  conditional  attributes  of  Oi  do  not  satisfy  conditions  of  any  rule  in  R.  In 
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such  a  case  we  cannot  make  decision  since  there  is  no  knowledge  within  our 
rule  set  that  covers  the  case  of  oi. 

2.  There  are  several  rules  in  R  that  are  applicable  to  oi  but  they  give  con¬ 
tradictory  outputs.  This  situation,  known  as  conflict  between  rules  must  be 
resolved  by  applying  procedures  to  measure  the  confidence  of  particular  rules 
(or  groups  of  them). 

There  is  a  number  of  possible  solutions  to  above  two  problems.  Usually  to 
resolve  the  problem  of  non-applicability  of  rules  one  of  three  methods  may  be 
applied: 

-  The  object  is  assigned  the  default  value  of  decision  according  to  preset  as¬ 
sumptions. 

-  The  rule  that  has  best  (according  to  a  given  criterion)  applicability  is  chosen 
and  the  decision  is  determined  by  this  rule.  The  applicability  criterion  may 
be  based  e.g.  on  number  of  conditions  in  the  rule  that  are  satisfied  by  object. 
Other  such  criterion  may  be  induced  by  preferences  about  decision  value  like 
in  case  of  ordered  decision  domain. 

-  The  ’’don’t  know”  signal  is  returned  to  the  user. 

Of  course,  rule  based  decision  systems  axe  usually  build  in  the  manner  to 
avoid  the  situation  of  not  recognizing  new  object.  But  still,  the  actual  accuracy 
depends  on  the  quality  of  derived  rules. 

The  matter  of  resolving  conflicts  between  rules  may  be  even  more  compli¬ 
cated,  especially  in  case  when  we  have  bunch  of  them  and  no  external,  additional 
information  about  their  applicability  and  importance.  To  cope  with  that  prob¬ 
lem,  several  techniques  may  be  applied  (refer  to  [7]).  Bringing  all  of  them  here 
is  rather  impossible  but  we  discuss  some  below. 

The  most  popular  way  to  establish  final  decision  is  based  on  comparison  of 
the  number  of  rules  form  different  decision  classes  that  are  applicable  to  a  given 
object.  The  object  is  assigned  to  the  class  determined  by  majority  of  the  rules 
(in  comparison  with  other  classes).  This  method,  however,  causes  unification  of 
rule  importance.  This  may  be  a  serious  weakness  and  in  order  to  avoid  it  weights 
may  be  assigned  to  rules  (or  groups  of  them).  The  method  we  exploit  in  our 
experiments  is  based  on  the  following  formula  describing  weight  for  set  of  rules: 

Wsssi^iO)  = 

''  card(SuppAir)SCA{r) 

=  if  ^  cardiSupp^ir)  ■  SC^(r)  >  0  (3) 

t£M  rEM 

0  otherwise 

where  SCA{r)  is  called  stability  coefficient  and  it  is  determined  during  the  pro¬ 
cess  of  rule  calculation  using  dynamic  reducts  (see  [3],  [4]  for  detailed  explana¬ 
tion).  To  give  some  intuition  about  SC  Air)  is  is  worth  knowing  that  it  mainly 
depends  on  frequency  of  occurrence  of  rule  r  in  the  set  of  optimal  rules  at  sub¬ 
sequent  steps  of  the  dynamic  algorithm  for  rule  generation  (see  [4]).  We  use  this 
method  because  numerous  experiments  (see  [4],  [3],  [11])  prove  that  it  is,  on  the 
average,  better  than  other. 
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4  Rough  set  rule  induction 

The  process  of  creating  rules  with  use  of  rough  set  techniques  is  essential  for  our 
ideas  of  classifier  construction.  Therefore  some  basic  information  about  methods 
for  rule  induction  is  needed.  The  base  for  deriving  rules  is  reduct  calculation. 
Numerous  practical  experiments  show  that  usually  there  is  a  need  for  calculation 
of  several  reducts  in  order  to  get  satisfiable  quality  of  classification.  Most  of  the 
cases  involving  larger  set  of  data  require  calculation  of  reducts  and  rules  with  use 
of  dynamic  techniques.  Prom  technical  point  of  view  the  process  of  calculating 
the  reducts  and  rules  is  computationally  exhaustive  and  for  real-world  solutions 
some  approximate  techniques  like  heuristics  or  genetic  algorithms  are  engaged 
(see  e.g.[12]). 

The  derived  set  of  rules  R  may  be  for  some  reason  unsatisfactory.  The  major 
concerns  are: 

-  The  number  of  rules  is  excessive  so  the  cost  of  storing,  checking  against  and 
explaining  the  rules  is  not  acceptable. 

-  The  rules  are  too  general,  so  they  do  not  really  contain  any  valid  knowledge, 
or  too  specific,  so  they  describe  very  small  part  of  the  universe  in  too  much 
detail. 

To  avoid  at  least  part  of  the  problems  mentioned  above  we  may  apply  shorten¬ 
ing  procedures.  Those  procedures  allow  to  shorten  the  rules  and,  in  consequence, 
reduce  the  number  of  them.  The  process  of  rule  shortening  comprises  of  several 
steps  that,  in  consequence,  lead  to  removing  some  descriptors  from  a  particular 
rule.  Usually,  after  shortening,  the  number  of  rules  decreases  as  repetitions  occur 
in  the  set  of  shortened  rules.  There  are  several  methods  leading  to  this  goal,  for 
details  review  e.g.  [1],  [4],  [13]. 

5  Rules  as  attributes 

In  the  classical  approach,  once  we  have  decision  rules  we  are  at  the  end  of 
classifier  construction.  But  there  is  also  other  way  of  treating  the  rules  since 
they  describe  relations  existing  in  our  data.  Therefore  we  may  treat  them  as 
features  of  objects.  In  this  view,  the  process  of  rule  extraction  becomes  the  one 
of  new  feature  extraction.  These  features  are  of  higher  ’’order”  since  they  are 
taking  into  account  specific  configurations  of  attribute  values  with  respect  to 
decision. 

Let  us  consider  set  of  rules  R  =  (n,  We  may  construct  the  new 

decision  table  based  on  them. 

With  every  rule  ri  in  R  we  connect  a  new  attribute  arj.  The  decision  attribute 
remains  unchanged  as  well  as  the  universe  of  objects.  The  values  of  attributes 
over  objects  may  be  defined  in  different  ways  depending  on  the  nature  of  data. 
For  the  purposes  of  this  research  we  use  the  following  three  possibilities: 

-  ari(oj)  =  dk  where  dk  is  the  value  of  decision  returned  by  rule  if  it  is 
applicable  to  object  oj,  0  (or  any  other  constant)  otherwise. 
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-  ari(oj)  =  const  (usually  const  equal  1  or  -1)  if  the  rule  n  applies  to  the 
object  Oj,  0  (or  any  other  constant)  otherwise. 

—  In  case  of  tables  with  binary  decision  ari{oj)  =  1  if  the  rule  applies  to  the 
object  Oj  and  the  output  of  this  rule  points  at  decision  value  1.  ari{oj)  =  -1 
if  the  rule  ri  applies  to  the  object  Oj  and  the  output  of  this  rule  points  at 
decision  value  0.  When  the  rule  is  not  applicable  aTi{oj)  =  0. 

Due  to  technical  restrictions  in  further  steps  of  classifier  construction  it  is 
sometimes  necessary  to  modify  above  methods  by  e.g.  encoding  the  decision 
values  in  first  of  the  approaches  in  order  to  use  neural  network  as  it  is  in  our 
case. 

It  can  be  easily  seen  how  important  is  to  keep  the  rule  set  within  reason¬ 
able  size.  Otherwise  the  newly  produced  decision  table  may  become  practically 
unmanageable  due  to  the  number  of  attributes. 

6  The  making  of  classifier 

Equipped  with  the  decision  table  extracted  with  use  of  the  set  of  rules  we  may 
now  proceed  with  construction  of  final  classification  (decision)  system.  In  order 
to  keep  computation  within  reasonable  size  with  respect  to  time  and  spatial 
complexity  we  apply  very  simple  and  straight  methods.  Namely  we  use  a  simple 
sigmoidal  neural  network  with  no  hidden  layers  (see  [6]).  The  overall  process  of 
classifier  construction  is  illustrated  in  Figure  1. 


Fig.  1.  The  layout  of  new  classifier 


We  start  with  initial,  training  decision  table  for  which  we  calculate  reducts 
and  the  set  of  possibly  best  rules.  We  may  derive  rules  in  dynamic  or  non¬ 
dynamic  way  depending  on  the  particular  situation  (data).  These  rules  are  then 
used  to  construct  new  decision  table  in  the  manner  described  in  previous  section. 
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Over  such  constructed  new  data  table  we  build  neural  network  based  classifier 
to  classify  newly  formed  objects.  Then  classifier  is  checked  against  quality  on 
testing  set. 

Of  course  with  the  proposed  scheme  we  may  construct  various  classifiers  as 
some  parameters  may  be  adjusted  on  any  step  of  this  process.  In  the  process 
of  reduct  and  rule  calculation  we  may  establish  restrictions  for  number  and  size 
of  reducts  (rules)  as  well  as  on  rule  specificity,  generality,  coverage  and  so  on. 
During  neural  network  construction  we  may  apply  different  learning  algorithms. 
The  learning  coefficients  of  those  algorithms  may  vary  as  well. 

To  complete  the  picture  of  classifier  it  is  important  to  add  a  handful  of 
technical  details.  For  the  purpose  of  the  research  presented  in  this  paper  we  used 
dynamic  calculation  of  rules  based  on  genetic  algorithm  and  incorporating  some 
discretisation  techniques  for  attributes  continuous  in  their  nature.  For  details 
consult  [5].  On  the  side  of  neural  network  we  used  simple  architecture  with 
neurons  having  clcissical  sigmoid  or  hyperbolic  tangent  as  the  activation  function. 
Usually,  the  network  is  equipped  with  bias  and  trained  using  gradient  descent 
with  regularisation,  momentum  and  adaptive  learning  rate  (see  [6], [2]). 

The  simple  architecture  of  neural  network  has  one  additional  advantage. 
From  its  weights  we  may  decipher  the  importance  of  particular  attributes  (rules) 
for  decision  making.  It  is  usually  not  the  case  of  more  complicated  neural  archi¬ 
tecture  for  which  such  an  interpretation  is  difficult  and  the  role  of  single  inputs 
is  not  transparent. 


7  Experimental  results 

The  proposed  methods  have  been  tested  against  real  data  tables.  For  testing  we 
used  two  benchmark  datasets  taken  from  repository  [14]  and  one  dataset  received 
from  medical  sources.  Table  1  below  describes  basic  parameters  of  decision  tables 
used  in  experiments.  The  EEG  data  was  originally  represented  as  matrix  of 


Dataset 

Objects 

Attributes 

Attribute  type 

rank(d) 

Monkl 

432 

6 

binary 

2 

Monk2 

432 

6 

binary 

2 

Monk3 

432 

6 

binary 

2 

Lymphography 

148 

18 

symbolic 

4 

EEG 

550 

105 

binary 

2 

Table  1.  Datasets  used  for  experiments. 


signals  that  was  further  converted  to  binary  form  by  applying  wavelet  analysis 
and  discretisation  techniques  as  originally  proposed  in  [11],  [5]  and  developed 
in  [10].  The  MONK  datasets  have  preset  partition  into  training  and  testing  set, 
rest  of  the  data  sets  were  tested  using  cross-validation  method. 
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The  rules  were  calculated  using  dynamic  techniques.  Then  we  performed  sev¬ 
eral  experiments  using  different  rule  shortening  ratio.  The  table  2  below  shows 
best  results.  Columns  in  this  table  describe  number  of  rules  used  for  new  table 


Data 

Number 

Shortening 

Method 

Error 

sets 

of  rules 

ratio 

Proposed 

Other 

Monkl 

31 

0.6 

TT 

0/0.03 

0/0 

Monk2 

26 

0.6 

TT 

0/0.06 

0/0.049 

Monks 

44 

0.6 

TT 

0/0.051 

0/0.046 

Lymphography 

78 

0.8 

CV-10 

0.03/0.19 

0/0.15 

EEG 

13 

0.3 

CV-5 

0/0.01 

0.11/0.16 

Table  2.  The  results  of  experiments. 


{Number  of  rules),  shortening  ratio  of  rules  (between  0  and  1),  method  of  train¬ 
ing/testing  (TT=train  &  test,  CV-n  =  n-fold  cross-validation),  average  error  on 
training/ testing  set  as  a  fraction  of  the  number  of  cases  and  best  results  got 
from  other  rough  set  methods  for  comparison.  The  experiments  were  performed 
several  times  in  order  to  get  averaged  (represent aive)  results.  The  comparison 
is  made  with  best  result  got  from  application  of  combined  rough  set  methods. 
However,  it  is  important  to  mention  that  those  best  classifiers  are  usually  based 
on  much  larger  sets  of  rules. 

The  results  are  comparable  to  those  published  in  [8]  and  [3]  but  they  usually 
use  much  less  rules  and  simpler  setting  of  classifier  than  in  case  of  best  results 
in  [4]  and  [3].  The  most  significant  boost  is  visible  if  we  compare  the  outcome 
of  classification  using  only  the  calculated  rules  with  classical  weight  setting. 
Especially,  in  the  case  of  small  shortening  ratio  which  corresponds  to  significant 
reduction  of  rules,  the  impact  of  methods  proposed  is  clearly  visible. 


8  Conclusions 

The  proposed  approach  allows  to  construct  classifier  with  combination  of  rule 
based  systems  and  neural  networks.  The  rough  set  rules  derived  with  respect 
to  the  discernibility  of  object  seem  to  posses  extended  importance,  if  used  as 
new  feature  generators.  Application  of  neural  network  in  last  stage  of  classifier 
construction  allows  better  fitting  to  particular  set  of  data  and  makes  further 
addition  of  new  knowledge  to  th  system  easier  due  to  its  adaptiveness. 

Initial  experiments  show  promising  results,  especially  in  cases  of  binary  deci¬ 
sion.  Reduction  of  the  number  of  rules  used  makes  system  obtained  in  this  way 
closer  to  natural  intuitions. 

As  the  work  on  this  issue  is  on  its  beginning,  there  is  still  a  lot  to  do  in  many 
directions.  Most  interesting  from  our  point  of  view  is  further  investigation  of 
relationships  between  process  of  rule  induction  with  rough  sets  and  their  further 
quality  as  new  attributes. 
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Abstract.  In  this  paper  ^  we  show  a  new  learning  algorithm  for  pattern 
classification.  A  scheme  to  find  a  solution  to  the  problem  of  incremental 
learning  algorithm  is  proposed  when  the  structure  becomes  too  complex 
by  noise  patterns  included  in  the  learning  data  set.  Our  approach  for  this 
problem  uses  a  pruning  method  which  terminates  the  learning  process 
with  a  predefined  criterion.  Then  an  iterative  model  with  a  3  layer  feed¬ 
forward  structure  is  derived  from  the  incremental  model  by  appropriate 
manipulation.  Note  that  this  network  is  not  fully  connected  between  the 
upper  and  lower  layers.  To  verify  the  effectiveness  of  the  pruning  method, 
the  network  is  retrained  by  EBP.  We  test  this  algorithm  by  comparing 
the  number  of  nodes  in  the  network  with  the  system  performance,  and 
the  system  is  shown  to  be  effective. 


1  Introduction 

Conventional  iterative  models  such  as  EBP  usually  have  a  fixed  feedforward  net¬ 
work  structure  and  use  an  algorithm  to  gradually  modify  the  weights  of  networks 
as  learning  proceeds.  So  this  approach  does  not  allow  to  expand  the  network  dur¬ 
ing  training.  This  approach  sometimes  has  a  critical  limitation,  depending  on  the 
trial  and  error  method  or  ad  hoc  schemes  to  obtain  an  appropriate  architecture 
for  learning  patterns. 

Therefore  another  approach  is  devised  to  solve  this  problem  by  adding  nodes 
to  the  network  when  necessary.  This  type  of  learning  is  referred  to  as  incre¬ 
mental  learning  as  the  network  grows  as  training  occurs.  As  a  procedure,  Lee 
et  al.[2]  have  proposed  an  incremental  algorithm  using  Fisher’s  Linear  Discrimi¬ 
nant  Function (FLDF)[1].  This  model  searches  an  optimal  projection  plane  based 
on  the  statistical  method  for  pattern  classification.  And  then,  after  projecting 
patterns  on  this  projection  plane,  this  model  starts  a  search  procedure  for  an 
optimal  hyperplane  based  on  an  entropy  measure  and  thus  determines  the  neu¬ 
ron  in  the  structure. 

^  This  research  is  supported  by  Brain  Science  and  Engineering  Research  Program  in 
Korea 
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Lee  et  al.[3]  introduced  a  neural  network  learning  algorithm  which  transforms  a 
structure  of  an  incremental  model  into  that  of  an  iterative  model.  This  model 
showed  that  the  weights  and  thresholds  as  well  as  the  structure  of  the  3  layer 
feedforward  neural  network  can  be  found  systematically  by  examining  the  in¬ 
stances  statistically.  It  is  well  known  that  a  major  part  of  the  learning  capability 
is  in  the  architecture  of  its  models. 

In  iterative  models  the  approaches  to  solving  this  problem  are  as  follows.  Kung 
et  al.[5]  proposed  a  method  which  is  learning  with  a  network  structure  with  a 
predefined  node  number.  But  this  method  had  a  problem  in  that  it  converges 
less  than  the  theoretical  bases.  Sietsma  and  Dow[7]  devised  an  algorithm  which 
assigns  many  nodes  in  prior  learning  and  then  removes  nodes  by  making  an  ob¬ 
servation  of  inactive  nodes  in  learning.  Though  this  algorithm  can  be  applied  to 
simple  problems,  when  a  problem  is  more  complex  one  encounters  many  difficul¬ 
ties.  Hanson  and  Pratt[9]  proposed  an  algorithm  which  removes  hidden  nodes 
with  a  constraint  term  in  EBP  function.  But  this  algorithm  has  a  side  effect 
which  reduces  the  probability  of  the  convergence.  Hagiwara[8]  proposed  an  al¬ 
gorithm  which  considers  a  proper  node  number  and  weight  values  concurrently. 
But  this  algorithm  needs  much  time  to  converge.  Moody  and  Rognvaldsson[10] 
proposed  adding  the  complexity-penalty  term,  but  it  has  much  more  compli¬ 
cated  form  and  also  demands  much  more  computational  complexity.  Wangchao 
et  al.[ll]  introduced  the  sparselized  pruning  algorithm  for  a  higher-order  neural 
network,  but  it  is  applied  after  all  the  higher-order  weights  are  trained. 

The  incremental  model  uses  an  algorithm  trying  to  produce  a  neural  network 
with  near-optimal  architecture  intelligently.  But  this  model  has  a  drawback  in 
that  it  can  be  extremely  extended  by  noises  included  in  patterns.  In  this  paper 
we  propose  a  method  to  solve  this  problem. 

2  Background 

2.1  The  Incremental  Network  Model 

In  this  paper  we  present  a  pattern  by  a  vector  of  n  components  and  describe 
a  pattern  classifier  as  a  mapping  of  the  input  pattern  space,  a  subset  of  n  di¬ 
mensional  real  space,  to  the  set  of  classes  l,2,...,k.  In  order  to  make  the  output 
decisions  we  develop  the  constructs  which  build  internal  representations  of  a 
class  description.  For  doing  this,  we  represent  one  unit  as  a  hyperplane  specified 
by  elements  (weight  vector,  threshold  value)  and  then  partition  the  space  as 
follows. 

Hyperplane{P)  =  X  =  T}  (1) 

where,  X  :  Input  pattern,  R"  :  Real  space,  W  :  Weight,  T  :  Threshold 

The  hyperplane  P  separates  the  space  into  two  sets,  and  P^.  Thus  input 

X  belongs  to  P^  or  P^  by  P. 

P^  =  {X\X  €  and  W'^X  >  T} 

P^  =  {X\X  e  and  W’^X  <  T) 
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The  network  structure  of  the  model  consists  of  one  input  unit,  a  number  of 
hidden  units  and  output  units  as  many  as  the  number  of  classes.  Each  unit  has 
a  weight  vector  and  a  threshold  value.  The  input  X  is  broadcasted  to  all  units 
and  for  each  unit  at  most  one  path  is  activated.  Thus,  in  the  whole  network  one 
path  at  most  is  activated  for  each  input  vector. 


2.2  The  Training  Process 

During  the  training  phase,  a  collection  of  classified  patterns  describing  a  de¬ 
sired  class  is  presented  to  an  incrementally  formed  network  of  neurons.  And  the 
weight  vector  and  threshold  values  of  units  of  the  network  are  determined  by 
an  adaptation  process.  Each  unit  is  assigned  to  represent  a  certain  hyperplane 
which  is  part  of  the  discriminant  hypersurface  represented  by  the  network.  The 
training  set  is  presented  a  number  of  times  and  at  each  presentation  the  network 
is  expanded  by  adding  a  number  of  new  units.  The  adaptation  process  is  carried 
out  at  each  unit  independently  of  others. 

In  the  adaptation  process,  the  Fisher’s  linear  discriminant  function[2]  is  used 
in  order  to  determine  the  optimal  hyperplane.  Fisher’s  linear  discriminant  func¬ 
tion  provides  the  optimal  weight  for  the  input  pattern  data  for  an  arbitrary 
distribution.  Fisher’s  formula  for  n  classes  is  shown  in  (2).  The  optimality  is 
characterized  by  the  overall  measure  representing  the  mutual  distances  between 
a  set  of  projected  points  of  a  class  and  that  of  another  class  and  is  achieved  by 
maximizing  the  overall  measure,  B,  standardized  by  V.  We  use  Cauchy-Schwartz 
inequality  for  obtaining  maximum  value 

w^BW  ^  w^[Y,i{Ti-x^i-xfw  ,2^ 

w^vw  -  Xi){Xij  -  Xi^w  ''  ’ 

Let  0  =  then  (2)  becomes  formulus  attains  the 

highest  value  when  vector  j3  becomes  the  eigenvector  ci  which  is  associated  with 
the  highest  eigenvalue  A  of  the  matrix  Thus  weight  vector(W)  is 

obtained  as 

The  threshold  value  determining  the  position  of  the  hyperplane  is  obtained  based 
on  the  following  entropy  function. 

H{C\{5'^))  =  d  :  cursor  position  (3) 

After  projection( ,  the  projected  points  are  divided  into  two  parts  by  a 
dividing  plane  placed  on  a  cursor  position.  After  entropy  is  measured,  the  plane 
is  moved  one  at  a  time  from  di  (W'^Xi)  to  The  optimal  position 

is  where  the  smallest  value  of  entropy  is  found.  Let  ni  (722)  be  the  number  of  left 
(right)  region,  then  PL*  =  ni/(ni  +  712)  and  PR*  =  712 /(ni  -f  712).  Let  Xij  be 
the  number  of  class  j  events  in  each  region  i(left,  right).  The  probability  pij 
whose  class  will  be  j  is  Xij/rii,  Then,  H(pi)  ~  ~  Y^Pijlog2Pij  i=l,2  j=l,..., class 
number. 
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2.3  Translation  into  an  iterative  model  structure 

In  this  section  we  present  a  transforming  procedure  which  converts  the  incre¬ 
mental  network  topology  into  an  iterative  one. 

Step  1  :  The  input  layer  consists  of  as  many  nodes  as  the  number  of  variables 
(dimensions)  in  learning  patterns. 

Step  2  :  The  first  layer  has  the  same  number  of  nodes  as  that  of  the  hidden 
nodes  except  for  leaf  nodes  in  the  incremental  model.  The  weight  and  the 
threshold  between  the  input  layer  and  the  first  layer  are  fully  connected  and 
have  the  same  values  as  those  of  hidden  nodes  in  the  incremental  model. 

Step  3  :  The  second  layer  has  as  many  nodes  as  the  number  of  leaf  nodes 
representing  the  region  of  each  class  in  the  incremental  model.  Each  region 
is  made  by  the  intersection  of  hyperplanes  in  the  first  layer,  thus  this  layer 
is  characterized  by  AND  :  when  all  the  inputs  are  active,  the  output  is 
active.  Weight  and  threshold  between  the  first  layer(j)  and  the  second  layer(i) 
is  determined  by  each  path,  from  the  input  node  to  the  leaf  node  in  the 
incremental  model  :  that  is. 


(^iL  =  — Ij  (^iR  —  +1 

Wij{weight)  =  gw,  Ti  =  E\aiD\  ~  0.5 

Here,  i=0,...,(the  number  of  discriminated  region- 1)  and  crii)(D=L(left  space), 
D=R(right  space))  denotes  ith  path. 

Step  4  :  In  the  third  layer  there  are  as  many  nodes  as  the  number  of  classes. 
The  intersected  regions  from  the  second  layer  are  unioned  in  each  class  re¬ 
gion.  Weight  and  threshold  between  the  second (j)  and  the  third (i)  layer  is 
determined  by  the  class  of  each  region. 

If  the  region  from  ith  path  contains  class  j  :  Rij  =  1 
Otherwise  :  Rij  =  0 

Wij  (Weight)  =  Rij  . 

Ti  (Threshold)  =URij  —  0.5 


For  further  information,  please  refer  to  [3], 

3  Pruning  in  the  incremental  learning  Model 

3.1  Pruning  algorithm  for  the  proposed  incremental  model 

Most  of  the  network  structure  of  the  incremental  learning  algorithm  have  the 
shape  of  the  binary  trees  and  hence  the  number  of  the  nodes  does  not  need 
to  be  predetermined  as  the  structure  is  determined.  But  an  incremental  model 
has  to  solve  a  new  problem,  as  there  can  be  too  many  nodes  to  be  added.  This 
is  because  a  learning  pattern  set  can  contain  many  noisy  data.  The  algorithm 
above  also  has  this  problem  as  it  iterates  until  there  is  only  one  class  of  data 
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contained  at  a  divided  space.  As  the  algorithm  proceeds  recursively,  not  only  the 
computational  time  but  also  the  memory  wastes  are  increased.  Moreover,  the 
network  performance  can  be  degraded  because  of  the  noise  effect.  Therefore,  an 
algorithm  must  be  devised  to  overcome  these  problems.  The  following  formulas 
are  added  to  the  algorithm  introduced  in  the  section  2.2  to  solve  those  problems: 

IF{N{^)  >  PruneRate)  Continue  the  learning  procedure. 

ELSE  Terminate  the  learning  procedure. 


where  Prune  Rate  is  a  criterion  for  the  percentage  of  the  noise.  And  N{^)  for 
the  learning  pattern  P  at  node  i  is  determined  as  follows: 


Ni{P) 


{#TE  -  #MCE) 
#MCE 


X  100.0 


-  #TE  :  Total  number  of  patterns  at  node  i. 


(4) 


-  #MCE  :  The  number  of  patterns  of  a  class  with  the  most  number  among 
the  #TE  patterns. 


3.2  Selective-learning  algorithm 

In  the  section  of  2.3  we  introduced  a  method  which  transforms  the  structure  of 
an  incremental  model  into  that  of  a  3  layer  iterative  model.  This  transformed 
network  structure  is  not  a  fully  connected  but  partially  connected  one  except 
for  between  the  input  layer  and  the  hidden  layer.  In  this  section  we  propose  a 
method  to  reduce  the  network  structure  using  an  iterative  learning  algorithm. 
This  algorithm  is  based  on  the  observation  that,  in  the  algorithm  explained  in 
the  section  3.1,  the  pruning  procedure  corresponds  to  a  node  reduction,  and 
the  partial  connection  between  layers  corresponds  to  a  reduction  of  weights  set 
size.  The  network  structure  of  this  model  consists  of  nodes  with  the  threshold 
function  and  the  sigmoid  function.  This  structure  is  shown  in  Fig.  1.  We  use 
the  EBP  algorithm  to  train  the  structure.  As  an  initial  structure,  the  first  layer 
is  constructed  with  the  nodes  with  the  same  weights  and  threshold  values  de¬ 
termined  in  incremental  learning.  And  the  bias  nodes  are  added  to  the  second 
hidden  layer  and  the  output  layer. 

The  learning  procedure  of  this  model  is  described  below.  The  first  hidden 
layer(O)  is  activated  as  follows: 

IF{{Y:^ijX^)>Ti)  Oi  =  l 
ELSE  Oi  =  0 

An  EBP  learning  is  performed  on  the  upper  layer  using  the  output  from  the 
first  layer.  As  our  model  is  partially  connected,  the  EBP  is  done  according  to 
the  following: 

ApWji{n  H- 1)  =  Cji{i]Sp.Opi  -I-  aApWji{n))  (5) 

where  Cji  is  1  if  there  is  a  connection  between  node  j  in  the  upper  layer  and 
node  i  in  the  lower  layer,  and  0  otherwise. 
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Fig.  1.  The  network  structure  of  the  selective  learning  model 


4  Implementation 

To  test  our  system,  we  use  sleep  stage  scoring  data  sets  and  the  speech  data 
used  by  Peterson  and  Barney.  And  we  implemented  the  following  and  measured 
the  performance. 

1.  Fisher  :  Weight  vector  and  threshold  values  are  determined  using  the  al¬ 
gorithm  in  section  2.2.  Prun  Rate(PR,  0  <  <  100)  is  varied  and  the 

performance  is  measured. 

2.  Prun  :  After  transforming  into  the  network  topology  explained  in  section 
2.3,  we  train  the  fully  connected  network  by  EBP  with  the  weights  and  the 
thresholds  initialized  from  the  procedure  1. 

3.  Sel  :  After  transforming  into  the  network  topology  explained  in  section  2.3, 
we  train  the  partially  connected  network  by  EBP  with  weights  and  thresholds 
initialized  from  the  procedure  1.  The  performance  is  measured  as  the  number 
of  connections  varies. 

PR  is  increased  by  1%  from  0  to  30  %,  and  by  5%  after  30%.  The  learning 
rate  rj  and  the  momentum  rate  a  are  set  to  0.2  and  0.7,  respectively.  Prom  the 
experiment,  we  observe  that  the  performance  of  the  "Sel”  upto  PR  equals  21% 
and  the  ”Prun”  upto  PR  equals  40%  is  similar  or  even  better  than  that  of  the 
’’Fisher”.  The  result  is  shown  in  Fig.  2.  Fig.  3  shows  that  the  number  of  the 
nodes  decreases  as  PR  increases.  Fig.  4  shows  that  the  number  of  connections 
decreases  as  the  PR  increases.  As  ’’Prun”  has  fully  connected  network  structure 
while  ”Sel”  has  a  partially  connected  one,  the  number  of  the  connections  of  the 
’’Prun”  is  more  than  that  of  the  ”Sel”. 
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Fig.  2.  The  performance  comparison  of  3  learning  methods 


Fig.  3.  The  number  of  nodes  vs.  PR. 
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Prun  Rate 


Fig.  4.  The  number  of  connections  vs.  PR. 


5  Conclusion 

In  this  paper  a  solution  is  proposed  to  solve  the  problem  that  the  network  struc¬ 
ture  of  an  incremental  model  can  be  extended  excessively  when  the  learning 
pattern  contains  many  noisy  patterns.  The  proposed  method  uses  a  predefined 
parameter,  PR,  to  stop  the  recursive  process  in  making  the  network  structure. 
After  this  binary  tree  network  structure  is  transformed  into  the  three  layer  feed¬ 
forward  structure,  the  EBP  is  employed  to  train  the  structure  further.  An  ap¬ 
propriate  number  of  nodes  and  the  corresponding  weights  between  the  nodes  are 
determined,  which  is  the  aim  of  the  pruning  process. 
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Abstract.  Intuitively,  patterns  of  numerical  sequences  are  often  inter¬ 
preted  as  formulas.  However,  we  observed  earlier  that  such  an  intuition 
is  too  naive.  Notions  analogous  to  Kolmogorov  complexity  theory  are  in¬ 
troduced.  Based  on  these  new  formulations,  a  formula  is  a  pattern  only 
if  its  pattern  complexity  is  simpler  than  the  complexity  of  data. 


1  Introduction 


Mathematicians  routinely  write  down  the  general  term  of  a  given  sequence  by 
inspecting  its  initial  terms.  Such  actions  involve  pattern  discovery  and  sequence 
prediction.  The  automation  of  the  latter  has  been  an  important  area  in  machine 
learning  [5]. 

Intuitively,  the  pattern  of  a  numerical  finite  sequence  is  often  interpreted  as 
a  formula  that  generates  the  finite  sequence.  Earlier,  we  observed  [3],  somewhat 
surprisingly,  that  such  a  simple  minded  formulation  leads  to  no  prediction  phe¬ 
nomenon;  see  Section  3.  Briefly,  given  any  real  number,  there  is  a  ’’pattern”  that 
predicts  it.  This  phenomenon  prompts  us  a  more  elaborated  notion  of  patterns. 

One  possible  fundamental  approach  is  Kolmogorov  complexity  theory  [4] ,  in 
which  patterns  are  interpreted  as  algorithms.  This  approach  is  theoretical;  there 
is  no  practical  way  to  determine  the  complexity  of  any  explicitly  given  finite 
sequence.  Aiming  toward  practical  applications,  various  notions  of  pattern  com¬ 
plexities,  analogous  to  that  of  Kolmogorov’s,  are  proposed;  see  Section  5.  Follow¬ 
ing  Kolmogorov,  we  define  the  complexities  of  patterns  and  data,  but  based  on 
function  theoretic  views,  instead  of  algorithmic  views;  as  a  conclusion,  a  formula 
is  a  pattern,  only  if  its  complexity  is  simpler  than  that  of  the  numerical  data  . 

The  proposed  theories  are  probably  overly  simplified  notions,  but  are  practi¬ 
cally  manageable  approximations  to  that  of  Kolmogorov.  Finally,  we  may  want 
to  point  out  that  mathematicians  use  not  only  the  numerical  values  but  also 
their  ’’physical”  expressions  to  predict  the  sequence.  In  this  paper,  we  focus, 
however,  only  on  the  numerical  values. 
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2  Patterns  of  Numerical  Data 

Given  a  finite  numerical  sequence, 

Seq  :  ai,a2,...an* 

What  would  be  the  most  natural  meaning  of  its  pattern!  Intuitively,  a  pattern  is 
a  formula  G{x)  that  generates  the  finite  sequence.  Mathematically,  a  formula  is 
not  a  precisely  defined  term,  roughly,  it  can  be  interpreted  as  a  function  that  can 
be  expressed  by  well-known  functions,  such  as  polynomials,  trigonometric,  radial 
basis  or  other  special  functions.  Since  the  formula  G(x)  is  often  valid  beyond  n, 
such  a  formula  is  also  referred  to  as  a  generalization. 

Next  let  us  rephrase  the  problem  in  geometry;  the  finite  sequence  can  be 
viewed  as  a  set  of  points  in  Euclidean  plan. 

(I,ai),(2,a2),...(n,a„)... 

and  the  problem  is  to  find  a  function 

G  :  i  — >  a,-,  i  =  1, 2, . . . n 

whose  graph  is  a  ”nice”  curve  passing  through  these  given  points.  It  seems  clear 
there  would  have  many  ”nice”  curves.  Occam’s  razors  are  needed;  so  pattern 
complexity  theories  are  developed. 

3  Intuitive  Solution  -  No  Prediction  Phenomenon 

Let  us  recall  the  following  arbitrary  prediction  phenomenon  [3] :  Suppose  we  are 
given  a  sequence 


1, 3,5,7 

What  would  be  the  next  number?  Commonly,  one  would  say,  according  to  the 
pattern  of  the  initial  four  terms, 

the  next  number  would  be  9. 

However,  we  have  a  somewhat  surprising  observation.  We  got  following  examples 
from  using  Matlab; 

1.  fe(x)  predicts:  1,3, 5, 7, 6 

/6(x)  =  -0.1250a:^  +  1.2500X3  -  4.3750X2  -f  8.2500X  -  4.0000 

2.  fs{x)  predicts:  1,3, 5, 7, 8 

f^(x)  =  -0.0417a:^  +  0.4167X3  -  1.4583X2  +  4.0833X  -  2.0000 
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Let  us  recall  the  folowing  elementary  algebra:  If  m  values  are  assigned  to  m 
points  in  a  Euclidean  space  (of  dimension  n) ,  then  there  is  a  polynomial  of  n 
variables  which  assumes  the  given  m  values  at  m  given  points. 

NO  PREDICTION  THEOREM 
Given  a  finite  numerical  sequence 


ai,«2,a3,  -  On 

and  any  real  number  r,  there  is  a  pattern  that  would  predict  the  following  pattern 

fli  j  a2j  ^*3)  »'• 

It  is  clear  that  mere  formula  can  not  be  the  right  notion  for  patterns.  In  the  next 
few  sections,  we  develop  various  complexity  theories,  analogous  to  Kolmogorov’s, 
to  formulate  the  notion  of  patterns. 


4  An  Attempt  from  Algorithmic  Information  Theory 

A  finite  numerical  sequence  is,  of  course,  expressible  as  a  bit  stream.  So  one 
might  be  able  to  apply  Kolmogorov  complexity  theory  here.  Let  us  recall  few 
notions.  Let  K  denote  the  Kolmogorov  complexity:  Let  length(p)  be  the  length 
of  the  program  p  using  a  consistent  method  of  counting,  such  as  binary  length. 

K{a)  =  Min{length{p)  |  p  is  any  conceivable  program  that  generates  the 

string  a} 

Let  length{a)  be  the  length  of  a  string  a.  Then  a  is  said  to  be  Kolmogorov 
random,  if  K(a)  >  lengih{a).  A  finite  sequence  is  said  to  have  a  pattern  if 
K(a)  <  l€ngth{a).  Intuitively  K{a)  measures  the  complexity  of  a  pattern  and 
length{a)  the  complexity  of  data.  Next  let  us  quote  few  interesting  propositions: 

1.  Almost  all  finite  sequences  are  random  (have  no  patterns). 

2.  Gddel  type  incomplete  theorem:  It  is  impossible  to  effectively  prove  that 
they  are  random. 

Due  to  the  last  assertion,  algorithmic  information  theory  can  not  be  useful  here. 
More  practically  approaches  that  could  approximate  the  Kolmogorov  compexity 
are  needed. 


5  Complexity  Theories  of  Patterns 

Instead  of  capturing  the  algorithm  that  defines  G(a;),  we  are  looking  for  a  method 
to  describe  G{x)  in  terms  of  a  class  of  known  functions. 
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5.1  Algebraic  Information  Theory 

In  this  section,  we  will  use  the  class  of  polynomial  functions  as  our  basis.  Though 
in  apriori,  it  is  not  known  that  G{x)^x  =  1, 21, . .  .n  is  a  polynomial,  but  Weis- 
trass  approximation  theorem  states  that  G{x)  can  be  approximated  by  a  poly¬ 
nomial.  Since  the  domain  {jc  |  a;  =  1, 2, . .  .n}  is  a  finite  set  of  points,  so  it  / 
is. 

It  is  our  religious  belief  that  a  shortest  algorithm  defines  a  simplest  polyno¬ 
mial,  and  the  degree  is  the  best  measure  of  its  simplicity.  So  we  believe  G  is  the 
least  degree  of  the  polynomials  that  generate  the  finite  sequence  a  =  {a^  \  x  = 
1,2,.,  .n}.  Note  that  length(a)  =  (n  —  1).  By  mimic  Kolmogorov,  let  D  denote 
the  algebraic  complexity  and  define 

D{a)  =  M in{degree{p)  \  p  is  any  conceivable  polynomials  that  generates 

the  finite  sequence  a} 

From  the  well  ordering  principle  ([1],  pp  11),  there  is  a  polynomial  H  whose 
degree  is  D{a).  This  polynomial  H  is  the  desirable  G  on  the  domain  {ar  |  x  — 
1,2,  ...n)}.  However,  the  natural  domain  of  H  is  the  real  numbers;  it  is  well 
beyond  the  original  domain  of  G. 

We  need  few  notions:  Let  length{a)  be  the  length  of  a  finite  sequence  a.  Then 
a  is  said  to  be  algebraic  random^  if  D(a)  >  length(a).  A  finite  sequence  is  said  to 
have  a  pattern  if  D(a)  <  length{a).  Intuitively  D(a)  measures  the  complexity  of 
an  algebraic  pattern  and  length{a)  the  complexity  of  data.  So  H  is  the  algebraic 
pattern,  if  a  is  not  algebraic  random. 

Let  us  apply  the  theory  to  answer  the  no-prediction  phenomenon.  Note  that 
=  4  and  length{Seq)  =  4,  so  deg(fs{x))  >  length(Seq).  By  our 
theory,  the  polynomials  found  in  Section  3  are  not  patterns.  We  should  point 
out  that  fe(x)  is  excluded  out  by  its  degree  (algebraic  complexity).  Certainly,  it 
is  conceivable  that  feix)  may  not  be  excluded  out  from  algorithmic  view,  but 
our  religious  belief  will  not  admit  that. 

Finally,  we  would  like  to  point  out  that  both  algorithmic  and  algebraic  pat¬ 
terns  do  meet  the  requirements  of  the  first  and  second  razors  of  Pedro  Domingos 
[2],  except  the  simplicity  for  the  first  razor  is  measured  by  two  different  metrics. 
Even  though  our  religious  belief  stating  that  they  are  the  same,  in  reality,  it  may 
be  different,  and  the  results  may  be  addressing  some  deep  issues;  they  will  be 
investigated  in  near  future, 

5.2  Functional  Information  Theory 

Instead  of  polynomial  functions,  we  can  also  consider  any  Schauder  basis  (as 
Banach  space  [6])  of  the  function  space  (under  consideration),  such  as,  trigono¬ 
metric  functions  in  L^-space,  radial  basis  functions  in  -space,  and  many  others. 

In  these  categories  of  functions,  it  is  less  clear  what  would  be  the  simplest  one. 
For  trigonometric  functions,  such  as  sin{nx)  or  cos{mx),  we  believe  the  least 
positive  n  or  m  are  the  simplest.  Roughly,  the  weights  (as  used  in  neural  net¬ 
works)  are  the  measures  of  their  simplicity.  The  exact  meaning  of  weights  is,  of 
course,  Schauder  basis  specific;  we  shall  not  be  specific  here. 
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To  mimic  Kolmogorov,  we  need  some  notations:  Let  B  be  a  selected  Schauder 
basis.  A  linear  combination  of  functions  in  B  will  be  denoted  as  J5-combination. 
As  before,  G  denotes  the  function  that  generates  the  sequence  x.  Now,  let  F 
denote  the  functional  complexity  and  define 

F(a:)  =  M in{weight{p)  [  p  is  any  conceivable  J5-combination  that 

approximates  G  } 

As  in  previous  section,  the  domain  is  a  finite  set  of  points,  so  G  is  actually  a 
^-combination.  To  illustrate  the  idea,  we  will  use  the  terminology  of  this  section 
to  explain  the  results  in  previous  section.  Let  B  be  the  set  of  all  monomials.  B  is 
a  Schauder  basis.  Since  the  domain  is  a  finite  set  of  points,  G  is  exactly  a  linear 
combination  of  monomials,  in  other  words,  a  polynomial.  We  got  the  result  of 
previous  section  using  the  reasoning  in  this  section. 

As  before,  a  finite  sequence  is  said  to  have  a  pattern  if  F(x)  <  length(x). 


6  Conclusion 

This  paper  examines  various  notions  of  patterns  in  finite  numerical  sequences.  By 
adopting  the  approach  of  classical  algorithmic  information  theory  (Kolmogorov 
complexity  theory),  two  approximations,  called  algebraic  and  functional  infor¬ 
mation  theories  are  proposed.  Base  on  these  theories,  we  conclude  that  a  formula 
is  a  pattern  only  if  its  pattern  complexity  (with  respect  to  its  proper  theory)  is 
simpler  than  that  of  data.  We  believe  the  theories  should  be  useful  in  scientific 
discovery  or  financial  data  mining. 
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Abstract.  Performance  prediction  for  classification  systems  is  impor¬ 
tant.  We  present  new  techniques  for  such  predictions  in  settings  where 
data  items  are  to  be  classified  into  two  categories.  Our  results  can  be  in¬ 
tegrated  into  existing  classification  systems  and  provide  an  accurate  and 
predictable  tool  for  data  mining.  In  any  given  classification  case,  our 
approach  uses  all  available  training  data  for  building  the  classification 
scheme  and  guarantees  zero  classification  errors  on  the  training  data.  We 
re-use  the  same  training  data  to  predict  the  performance  of  that  scheme. 
The  method  proposed  here  enables  control  of  errors  over  two  types  of 
error  for  the  classification  task. 

Keywords:  Classification,  Data  Mining,  Decision  Support,  Error  Pre¬ 
diction. 


1  Introduction 

Performance  prediction  is  useful  for  evaluating  the  performance  of  a  classification 
system  and  for  comparing  or  combining  such  systems.  Thus,  it  is  an  important 
part  of  data  mining  [2] .  Traditional  performance  prediction  methods  [5]  withhold 
a  portion  of  the  given  data  during  training  and  estimate  the  errors  after  training 
from  that  portion.  Typically,  the  same  process  is  done  iteratively,  and  the  average 
of  all  error  estimations  is  the  final  estimate. 

We  have  developed  a  new  approach  for  estimating  performance  of  classifica¬ 
tion  systems.  We  carry  out  training  using  all  available  data  and  estimate  errors 
using  the  same  data.  In  two-class  classification,  the  predicted  error  distribution 
can  be  used  to  control  classification  errors.  In  the  next  section,  we  describe  how 
to  choose  a  reliable  classification  method  and  create  a  classification  family  as 
the  classifier  in  our  system  via  the  provided  training  data.  Section  3  introduces 
how  to  use  the  same  training  data  to  estimate  the  performance  of  the  classi¬ 
fication  family  and  come  out  a  decision  scheme  for  the  classifier  based  on  the 
performance  estimation.  We  give  experimental  results  and  conclusions  in  Section 
4  and  5  respectively. 

*  This  work  was  done  when  the  author  studied  in  the  Computer  Science  Department 
of  the  University  of  Texas  at  Dallas.  The  author  is  currently  working  for  Alcatel 
Network  Systems. 
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2  Construction  of  a  Classification  Family 

One  can  employ  any  existing  classification  method  such  as  decision  tree,  Bayesian 
classifier,  neural  network,  to  construct  the  classification  family  in  this  module, 
providing  that  the  method  will  generate  a  vote  count  to  a  record  to  indicate  its 
classification  preference.  In  our  system,  we  use  a  method  which  is  developed  in 
our  lab  to  generate  the  classification  family  C  [3].  We  emphasize  the  way  of  how 
to  generate  the  classification  family,  and  because  of  the  way  we  choose,  we  are 
able  to  do  a  good  estimation  of  the  system  performance. 

There  are  two  disjoint  populations  A  and  B  of  records.  We  are  given  subsets 
A  C  A  and  B  C  B  as  training  data.  Given  the  training  sets  A  and  B,  we  first 
select  an  integer  d  >  5  and  partition  A  into  d  nonempty  subsets  A^,  . . . ,  A^ 

of  essentially  equal  cardinality  and  view  A^,  A^  as  a  circular  list.  We 

choose  another  variable  c  be  the  smallest  integer  that  is  larger  than  d/2;  We  take 
the  union  of  A^  and  of  the  (c  —  1)  subsequent  A^  and  call  that  union  Af,  that  is, 
Ai  =  We  obtain  Ai,  A2, . . . ,  accordingly.  Applying  the  analogous 

process  to  B,  we  obtain  the  similar  sets  of  Bj.  Between  each  pair  of  Aj  and 
Bj,  we  use  the  chosen  classification  method  to  generate  a  classification  family 
member  Ci  which  gives  e  votes  depending  on  different  criteria  [3].  Overall,  the 
classification  family  C  will  give  d  •  e  votes  to  any  record.  If  we  define  one  vote 
for  ^  is  +1  and  vote  for  B  is  —1,  the  final  total  number  is  referred  as  vote  total. 
Obviously,  due  to  the  cancellation  effect,  the  vote  total  could  be  in  the  range 
between  —d  •  e  and  +d  •  e.  We  use  2:  to  denote  the  vote  total.  In  our  experiments, 
we  use  d  =  10  and  e  =  4.  Thus,  ^  is  within  the  range  between  —40  and  +40.  We 
use  all  the  training  data  to  generate  a  classification  family  C.  And  this  C  will 
be  used  to  classify  new  data,  it  is  the  classifier  in  this  classification  system.  The 
vote  total  that  C  will  produce  to  ^(resp.  B)  can  be  viewed  as  a  random  variable 
Z^(resp.  Zs).  The  following  section  describes  how  to  estimate  the  probability 
distributions  of  and  Z5. 


3  Performance  of  the  Classification  Family 

In  this  section,  we  introduce  how  to  estimate  the  vote  total  distribution  that  C 
will  give  to  the  new  data.  We  use  A  class  as  an  explanatory  example  and  the  same 
methodology  applies  to  B  class  as  well.  We  need  to  estimate  three  parameters, 
namely  the  mean,  variance,  and  distribution  shape  of  the  vote  total. 

From  last  section,  we  know  that  C  is  composed  by  Ci,  (72,  •  •  *,  (7^.  A  record 
is  unseen  to  Ci  if  it  is  not  included  in  Ai  or  B^.  The  way  that  we  generated  Ci 
leaves  some  records  in  training  data  are  unseen  to  Ci.  That  is  equal  to  say,  if 
we  analyze  how  Ci  performs  on  these  unseen  data,  we  are  able  to  predict  the 
performance  of  Ci  on  new  data.  Applying  the  same  argument  to  all  Ci  in  G,  the 
aggregated  performance  is  exactly  the  vote  total  prediction  of  C. 

The  vote  count  of  Ci  given  to  an  unseen  record  k  is  denoted  as  i?(z,A;).  Let 
Xi  be  the  random  variable  to  represent  the  vote  count  given  by  Ci  for  unseen 
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records.  The  unseen  records  for  Ci  is  indeed  Ai  ~  A  ~  Ai.  Thus,  the  mean  and 
variance  of  Xi  can  be  estimated  by 

(3.1)  Ax,  =  [l/pi|] 

keAi 

(3.2)  =  [l/(|Ai|  -  1)]  ^  [t;(i,fc)  -  Ax,]" 

keA{ 


respectively. 

Since  Za  is  the  vote  total  of  C  and  thus  the  mean  value  for  Za  is  estimated 

by 

d 

(3.3) 

i=l 

For  the  covariance  estimation  between  Xi  and  Xj,  we  have  two  situations. 
First,  if  the  set  Aij  =  AiHAj  is  nonempty,  then  we  can  estimate  the  covariance 
of  Xi  and  Xj  by 

(3.4)  CTx,Xi  =  [l/py|]  5]^  [j;(i,fe)  -  Ax,][«(j,fc) -Ax,] 

keAij 

If  Aij  is  empty,  we  estimate  the  covariance  for  the  Xi  and  Xj  by  a  linear 
approximation  function  which  is  instructed  by  the  known  covariance  values 
and  the  amount  of  intersection  of  Aij  [7]. 

We  achieve  estimating  the  distribution  shape  of  Za  by  first  estimating  a 
smaller  distribution  of  the  total  unseen  vote  of  each  record  and  scale  up  this  dis¬ 
tribution  to  the  same  mean  and  variance  values  of  Za  •  The  mathematical  details 
are  described  in  [7].  Applying  the  above  described  methods  to  both  training  data 
classes  A  and  B,  we  have  two  estimated  distributions  of  how  the  classification 
family  will  vote  for  new  data. 

Based  on  C,  a  family  V  of  decision  schemes  Dz  is  generated,  where  2:  ranges 
over  the  possible  vote  totals  of  C.  We  use  this  decision  scheme  Dz  to  declare  a 
record  to  be  in  A  if  the  vote  total  produced  by  C  for  that  record  is  greater  than  or 
equal  to  and  declares  the  record  to  be  in  B  otherwise.  The  scheme  Dz  classifies 
a  record  correctly  if  the  vote  total  is  greater  than  or  equal  to  z ,  and  thus  does  so 
with  probability  P(Za  >  z).  Conversely,  misclassification  by  Dz  of  a  record  of 
A— A  and  thus  a  type  A  error  occur  with  probability  a  =  P{Za  <  z).  Analogous 
results  hold  for  B.  Clearly,  if  we  know  the  distributions  of  on  ^  -  A  and  of 
Zb  on  B  —  B,  we  will  have  the  a  and  /?  for  each  Dz- 

One  can  define  a  decision  function  of  Dz  based  on  the  two  error  values, 
namely  a  and  /I.  The  function  can  be  utilized  to  control  the  classification  error 
according  to  different  requirements  [7]. 
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4  Experiments 

We  have  implemented  the  above  approaches  to  a  classification  system  with  de¬ 
cision  support  as  Lsquare.  This  system  takes  the  input  training  data,  constructs 
a  classification  family  C,  estimate  the  vote  total  distributions  on  two  classes, 
and  provide  the  decision  function.  The  user  can  choose  the  decision  threshold 
in  the  decision  function  to  fulfill  the  needs  of  controlling  the  error  of  misclassi- 
fication.  Several  well-known  datasets  from  the  Repository  of  Machine  Learning 
Databases  and  Domain  Theories  of  the  University  of  California  at  Irvine  [6] 
have  been  tested  with  Lsquare.  We  show  the  results  of  graphs  for  the  Australian 
Credit  Card  problem  in  (5.1).  The  data  were  made  available  by  J.  R.  Quinlan. 
They  represent  690  MasterCard  applicants  of  which  307  are  declared  as  positive 
and  383  as  negative.  We  declare  A  (resp.  B)  to  be  the  set  of  negative  (resp. 
positive)  records.  We  obtain  from  A  and  B  randomly  selected  subsets  A  and  B, 
each  containing  50%  of  the  respective  source  set.  We  apply  Lsquare  to  A  and 
B,  obtain  the  family  C  of  classification  methods,  and  compute  the  estimated 
error  probabilities  a  and  Then  we  apply  C  to  A-  A  and  B  -  B  to  verify  the 
error  probabilities.  The  graphs  below  show  the  results.  The  curves  plotted  with 
diamonds  are  the  estimated  d  and  /§,  while  the  curves  plotted  with  crosses  are 
the  verified  values. 


(5.1) 


Decision  Parameter  2: 


Australian  Credit  Card 
Estimated  and  verified  a  and  p 
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5  Conclusions  and  Future  Research 

We  have  developed  new  strategies  and  techniques  to  predict  the  performance  of 
two-class  classification  systems.  We  predict  its  performance  using  the  same  data 
as  in  training  the  system.  In  two-class  classification,  the  predicted  error  distri¬ 
butions  can  be  used  to  control  classification  errors  when  new  data  are  classified. 
Predicting  performance  by  training  data  lets  the  learning  system  learn  more 
without  holding  a  portion  of  data  for  evaluation.  The  performance  prediction  is 
based  on  a  system  that  has  learned  all  the  given  data,  hence  it  is  representative 
of  future  performance. 

The  approaches  have  been  implemented  within  a  learning  system  Lsquare. 
It  was  tested  by  many  well-known  datasets  in  the  machine  learning  community. 
It  shows  our  performance  prediction  mechanism  to  be  very  reliable.  We  plan  to 
test  this  scheme  on  using  different  classification  methods  such  as  Decision  Trees, 
Neural  Networks,  and  Nearest  Neighbors  Algorithm,  etc.  This  will  further  trigger 
different  analysis  on  performance  prediction  for  different  classifiers. 
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Abstract.  The  “Ocean  SAmpling  MObile  Network”  (SAMON)  Project  is  a 
simulation  testbed  for  Web-based  interaction  among  oceanographers  and 
simulation  based  design  of  Ocean  Sampling  missions.  In  this  paper,  the  current 
implementation  of  SAMON  is  presented,  along  with  a  formal  model  based  on 
process  algebra.  Flexible  optimization  handies  planning,  mobility,  evolution, 
and  learning.  A  generic  behavior  message-passing  language  is  developed  for 
communication  and  knowledge  representation  among  heterogeneous 
Autonomous  Undersea  Vehicles  (AUV's).  The  process  algebra  subsumed  in  this 
language  expresses  a  generalized  optimization  framework  that  contains  genetic 
algorithms,  and  neural  networks  as  limiting  cases. 


1  Introduction 

The  global  behavior  of  a  group  of  interacting  agents  goes  beyond  juxtaposition  of 
local  behaviors.  Wegner  [15]  indicates  that  interaction  machines,  formed  by  multiple 
agents,  have  richer  behavior  than  Turing  machines.  Milner  [7]  indicates  sequential 
processes  cannot  always  represent  concurrent  interactive  ones.  Realistic  applications 
of  autonomous  agents  require  new  models  and  theories.  Three  fundamental  questions 
remain  open: 

•  How  to  produce  intelligent  global  results  from  group  local  behaviors? 

•  How  to  decompose  problems  for  solution  by  independent  individual  agents? 

•  How  to  integrate  reactive  and  deliberative  behaviors? 

Our  application  uses  process  algebra  and  resource-bounded  computation  to  solve 
these  problems  and  plan  mobile  underwater  robot  group  missions.  In  this  paper,  we 
describe  a  flexible  optimization  methodology  for  agent  control  and  evolution.  The 
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optimization  model  unifies  genetic  algorithms  and  neural  networks  in  a  manner  suited 
to  reacting  to  changing  dynamic  environments  with  constrained  resources. 


2  SAMON  Underwater  Mobile  Robot  Testbed 

ARL's  SAMON  testbed  builds  upon  the  ARL  Information  Science  and  Technology 
Division’s  existing  AUV  technology.  The  ONR  SAMON  project  [11], [12]  studies 
networks  of  Autonomous  Underwater  Vehicles  (AUVs)  for  adaptive  ocean  sampling. 
It  contains  a  Web-based  testbed  for  distributed  simulation  of  heterogeneous  AUV 
missions,  and  advances  adaptive  autonomous  agent  design.  A  group  of  AUV’s 
attempts  missions  in  hazardous  environments.  The  group  is  organized  in  a  four  level 
hierarchy  (see  Fig.  1).  _ _ 


Underwater  Vehicle  (SAUV),  Autonomous  Underwater  Vehicle  (AUV)  and  Fixed 
Sensor  Packages  (FSP’s) 

A  Tactical  Coordinator  initiates  missions  by  transmitting  orders  to  several 
Supervising  Autonomous  Underwater  Vehicles  (SAUVs).  Each  SAUV  uses  sonar  to 
spontaneously  form  a  group  of  subordinate  AUV’s.  Each  AUV  collects  data  from 
Fixed  Sensor  Packages  (FSPs)  distributed  throughout  the  region.  This  data  is  relayed 
to  the  commanding  SAUV  and  Tactical  Coordinator.  SAUVs  and  AUVs  all  have 
identical  controllers.  Continuous  sensor  inputs  are  responded  to  by  discrete  decisions. 
It  is  a  typical  sense-plan-act  system.  ARL's  AUV  controller  combines  fuzzy  logic 
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with  artificial  neural  networks  as  described  in  [13].  Signal  processing  routines  use 
sensor  inputs  to  estimate  physical  variables.  Tasks  are  sequences  of  behaviors,  which 
are  sequences  of  atomic  actions.  Goal  Achievement  Functions  (GAF)  monitor  system 
progress.  The  sequence  of  behaviors  is  flexible.  New  elements  are  inserted  as 
required.  The  testbed  allows  remote  access.  It  integrates  remote  heterogeneous 
simulators.  A  Geographic  Information  System  (GIS)  ARCINFO  supports  the  Tactical 
Coordinator. 


3.  Process  Algebra  Model  for  Adaptive  Autonomous  Agents 

Expressing  and  formulating  emerging  behavior  requires  a  rigorous  formal  model,  with 
the  following  characteristics: 

•  Agents  are  autonomous. 

•  Agents  communicate  asynchronously  using  message-passing. 

•  Agents  are  encapsulated. 

•  Agents  can  be  heterogeneous. 

•  Agents  communicate  with  a  finite  number  of  neighbors. 

•  Group  reconfiguration,  such  as  link  and  node  migration,  is  possible. 

•  Groups  exhibit  complex  behavior  due  to  interaction  among  agents. 

•  Agents  and  groups  adapt  to  bounded  resources. 

Appropriate  formal  models  for  autonomous  agent  design  are  Ti-calculus  [7], 
interaction  machines  [15],  cellular  automata,  and  automata  networks  [4].  None  adapt 
to  bounded  resources.  Our  model  does,  and  it  is  as  expressive  as  any  other  model. 

Resource  bounded  computation  is  known  under  a  variety  of  names,  including 
anytime  algorithms  [16].  It  trades  off  result  quality  for  time  or  memory  used  to 
generate  results.  It  is  characterized  by: 

•  Algorithm  construction  to  search  for  bounded  optimal  answers. 

•  Performance  measure  and  prediction. 

•  Composability. 

•  Meta-control. 

We  use  a  process  algebra  variant  of  resource-bounded  computation  to  integrate 
deliberative  and  reactive  approaches  for  action  selection  in  real  time.  Our  approach 
has  been  developed  independently  of  anytime  algorithms  under  the  names  modifiable 
algorithms  [2],  and  $-calculus  [3]. 

$-calculus  proposes  a  general  theory  of  algorithm  construction.  Everything  is  a  $- 
expression:  agents,  behaviors,  interactions,  and  environments.  Elementary  behaviors 
are  $-expressions  representing  atomic  process  steps.  Simple  $-expressions  consist  of 
negation  cost  $,  send  receive  mutation  J,  and  user  defined  functions.  More 
complex  actions  combine  $-expressions  using  sequential  composition  °,  parallel 
composition  ||,  genera!  choice  LJ,  cost  choice  u,  and  recursive  definition  :=.  $- 
expressions  use  prefix  notation  similar  to  Lisp.  Each  $-expression  has  an  associated 
cost  value.  Data,  functions,  and  meta-code  are  written  as  (f  X ),  where  f  is  name  and 
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jc=(xi,X2v)  is  a  possibly  countably  infinite  vector  of  parameters.  $-expression 
syntax  is  summarized  below.  Let  P  denote  compound  $-expressions  and  a  simple  $- 
expressions: 


;=  (-,  a) 

negation 

($P) 

cost 

(-^(ae)) 

send 

(<-(ai-)) 

receive 

(J(ae)) 

mutation 

(a  Q ))  user 

def  simple  $-expr. 

P  ::=  (“ieiPj)  sequential  composition 
I  (IlieiPi)  parallel  composition 
I  Pi)  cost  choice 
I  (LJieiPi)  general  choice 
I  {f  Q)  user  def.  $-expression 
I  (:=  (f  X )  P)  recursive  def. 


I  is  a  possibly  countably  infinite  indexing  set.  We  write  empty  parallel  composition, 
general  and  cost  choices  as  ±,  and  empty  sequential  composition  as  e.  ±  expresses 
logic  false,  and  e  masks  parts  of  $-expressions.  Sequential  composition  is  used  when 
$-expressions  run  in  order,  and  parallel  composition  when  they  are  parallel.  Cost 
choice  expresses  optimization,  i.e.  it  selects  the  cheapest  alternative.  General  choice  is 
used  when  we  are  not  interested  in  optimization.  Call  and  definition,  such  as 
procedure  and  function  definitions,  specify  recursion  or  iteration.  This  approach  can 
describe  all  current  heuristic  methods,  and  provide  a  framework  for  choosing  between 
heuristics. 

Meta-control  is  a  simple  algorithm  that  attempts  to  minimize  cost.  Solution 
quality  improves  if  time  is  available.  Performance  measures  are  cost  functions,  which 
represent  uncertainty,  time,  or  available  resources.  Crisp,  probabilistic,  and  fuzzy- 
logic  cost  functions  are  part  of  the  calculus.  Users  may  define  their  own. 
Incomplete/uncertain  information  takes  the  form  of  invisible  expressions  whose  cost 
is  either  unknown  or  estimated.  Meta-control  can  choose  between  local  search  and 
global  search.  Global  search  methods,  like  genetic  algorithms,  process  multiple  points 
in  the  search  space  in  parallel. 

Scalability  and  composability  is  achieved  by  building  expressions  from 
subexpressions.  Recursive  definitions  decompose  expressions  into  atomic 
subexpressions.  Composability  of  cost  measures  is  assumed.  Expression  costs  are 
functions  of  subexpression  costs.  Deliberation  occurs  in  the  form  of  select-examine- 
execute  cycles  corresponding  to  sense-deliberate-act.  An  empty  examine  phase 
produces  a  reactive  algorithm. 

Short  (long)  deliberation  is  natural  for  interruptible  (contract)  algorithms. 
Interruptible  algorithms  can  be  interrupted  down  to  the  level  of  atomic  expressions. 
Interruptibility  is  controlled  at  two  levels:  choice  of  atomic  expressions  and  the  length 
of  the  deliberation  phase.  Contract  algorithms,  although  capable  of  producing  results 
whose  quality  varies  with  time  allocation,  must  be  given  an  agreed  upon  time 
allocation  to  produce  results. 

At  the  meta-level,  execution  is  monitored  and  modified  to  minimize  cost  using 
"k-C2  optimization."  Solutions  are  found  incrementally.  They  may  optimize  any  of 
several  factors.  Depending  on  problem  complexity,  cost  function  volatility  and  level 
of  uncertainty,  deliberation  can  be  done  for  /:=0,1,2,....  steps  or  until  termination. 
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Optimization  is  limited  to  alphabet  a  subset  of  the  complete  expression  alphabet. 
This  increases  run-time  flexibility. 

We  define  an  adaptive  agent  model,  as  parallel  composition  of  component  agents: 

(IliAi) 

where  (:=  (Aj)  MAj)  defines  agent  i  with  meta-system  control  MAj.  Agent  MAq  is  the 
environment.  Each  agent  MAj ,  i>0  has  a  finite  neighborhood  it  communicates  with, 
and  $-expression: 

(:=  MAj  (°  (init  Pio)  (loop  Pi))), 

where  loop  is  the  select-examine-execute  cycle  performing  optimization  until  the 
goal  is  satisfied.  At  which  point,  the  agent  re-initia!izes  and  works  on  a  new  goal: 

(:=  (loop  Pi)  (  LJ  (°  (->  goal  Pi)  (sel  Pi)  (exam  Pi)  (exec  Pi)  (loop  Pi)) 
('>(goalPi)MAi))) 

This  general  model  is  an  instance  of  resource-based  computation,  based  on  process 
algebras.  It  covers  a  wide  class  of  autonomous  agents,  including  SAMON  AUV's. 
The  graph  and  nodes  can  be  arbitrary.  We  only  require  that  the  nodes  “understand” 
the  messages  in  the  network.  The  environment  is  modeled  as  a  $-expression,  which 
can  be  a  non-deterministic  or  stochastic  (assuming  incomplete  knowledge  of  the 
environment).  A  distributed  environment,  if  needed,  can  be  modeled  as  a  subnetwork, 
instead  of  a  single  node. 

The  SAMON  network  topology  combines  a  4  level  tree  (starting  from  the  root: 
TC,  SAUVs,  AUVs,  FSPs)  and  a  star  topology  with  the  environment  as  a  central  node 
connected  to  all  remaining  nodes.  All  nodes  communicate  by  message-passing 
through  sensor  and  effectors.  The  input  and  output  messages  consist  of  orders  (sonar 
or  radio),  reports  (data  or  status),  and  sensory  data  from  and  to  environment.  In  the 
distributed  SAMON  testbed,  messages  take  the  form  of  TCP/IP  socket 
communication. 

The  hierarchical  structure  hides  complexity,  improves  reliability,  increases 
adaptation  and  execution  speed.  However,  an  optimal  tree  structure  must  be  derived 
for  a  specific  task.  To  find  an  acceptable  tree,  the  architecture  should  evolve.  For 
complicated  tasks,  a  strict  hierarchy  may  not  suffice.  For  example,  multiple  robots 
pulling  a  heavy  object  must  communicate  with  peer  nodes  to  be  successful. 
Temporary  mobile  links  can  do  this.  Cooperating  agents  need  performance  metrics 
with  feedback  to  achieve  their  objectives. 


4  Generic  Behavior  Message-Passing  Language 

SAMON  allows  vehicles  from  collaborating  institutions  to  communicate  and 
cooperate.  However,  existing  AUVs  (e.g.,  NPS  Phoenix  [1],  FAU  Ocean  Explorer 
[12])  employ  incompatible  designs.  It  is  too  early  to  enforce  a  single  standard  AUV 
design.  On  the  other  hand,  AUVs  designed  to  perform  similar  missions  should  be  able 
to  cooperate.  A  unique  aspect  of  SAMON  is  collaboration  among  heterogeneous 
AUVs.  For  this  purpose,  we  propose  a  common  communications  language:  Generic 
Behavior  Message-Passing  Language. 
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Our  collaborative  infrastructure  (Fig.2)  is  a  network  of  cooperating  agents 
communicating  by  send  and  receive  primitives.  $-calculus  provides  the 
communications  framework,  and  groups  elementary  behaviors  into  complex 
behaviors  (missions),  or  scenarios  (programs).  Missions  are  programs  of  elementary 
behaviors.  The  amount  of  decomposition  supported  depends  on  the  AUV 
implementation. 

Each  node  is  described  by  a  cost  expression,  and  implemented  as  an  autonomous 
unit.  Nodes  can  have  heterogeneous  architectures,  but  they  share  a  generic  behavior 
message-passing  language  to  interact  and  cooperate.  The  nodes  can  be  real  or 
simulated  vehicles,  for  instance  FAU,  NFS  or  ARL  PSU  AUV's,  TC's  (situated  on 
land,  air,  or  sea),  SAUVs,  environment,  and  base  recovery  vehicle.  Nodes  can  be 
connected  by  an  arbitrary  topology,  which  we  depict  as  a  bus.  Nodes  communicate 
using  their  own  message  formats.  Wrappers  will  convert  messages  to  the  generic 
language  specification.  Future  controllers  could  produce  messages  directly  in  the 
standard  form.  Wrappers  should  be  eliminated  as  a  long  term  goal. 

Generic  behavior  message-passing  requirements  language  include: 

•  Modularization  -  each  node  should  be  independent,  easy  to  replace  and  modify; 

•  Autonomy  -  objects  communicate  only  by  message-passing; 

•  Flexibility  -  allows  arbitrary  execution  of  behaviors; 

•  Extensible  -  for  new  types  of  vehicles,  or  new  environments; 

•  Evolving  -  mechanisms  for  adaptation  optimization; 

•  Research  oriented  -  communication  between  real  and  simulated  vehicles; 

•  Simple  -  but  relatively  complete; 

•  Programmable  -  messages  can  be  added,  but  may  not  be  accepted  by  all. 


ARL  PSU 
TC 


FAU  Ocean 
Explorer 


ARL  PSU 
SAUV/AUV 


NPS 

Phoenix 


Fig.  2  SAMON  testbed  collaborative  mission  execution  infrastructure 
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To  satisfy  these  requirements,  the  language  syntax  is  based  on  $-calculus.  All 
behaviors  are  functions,  transmitted  between  nodes  using  send  and  receive  primitives. 
A  set  of  predefined  generic-behavior  functions  is  a  library  on  the  node.  Users  can 
define  new  behaviors,  which  may  not  be  understood  by  other  controllers.  Our 
definition  does  not  specify  function  implementation.  Functions  are  black  boxes.  The 
language  has  two  behavior  types: 

•  elementary  behaviors,  low  level  communication  between  entities,  and 

•  agglomerate  behaviors,  group  behaviors  by  scripting  programs. 

Send  and  receive  primitives  provide  message-passing  communication  and 
synchronization  between  agents.  They  use  the  same  communication  channel  name. 
Channels  can  be  sonar,  radio,  satellite,  etc.  If  a  matching  channel  name  is  not  found, 
the  operation  blocks.  Parallel  composition  of  send  and  the  rest  of  program  models 
asynchronous  communication.  A  set  of  elementary  behaviors  has  been  formulated  for 
undersea  applications  of  hierarchical  AUV  networks. 


5  Optimization  and  Adaptation 

Adaptation  occurs  at  all  levels  in  the  hierarchy.  Optimization  is  limited.  Individual 
AUV's,  and  the  system  as  a  whole,  make  decisions  based  on  incomplete  information 
about  dynamic  processes.  Time  is  not  available  to  compute  strictly  optimal  solutions 
on-line.  Instead,  we  compute  a  satisficing  solution  that  provides  a  "good  enough" 
answer;  the  best  that  can  be  found  given  current  resources  and  constraints.  The 
testbed  uses  multiple  heterogeneous  AUV’s  to  collect  data  from  an  arbitrary  undersea 
environment.  Work  takes  place  in  hostile  environments  with  noise-corrupted 
communication.  Components  are  prone  to  destruction  or  failure.  Under  these 
conditions,  efficient  operation  can  not  rely  on  static  plans.  Each  AUV  is  self- 
contained  and  makes  as  many  local  decisions  as  possible.  Operational  information 
and  sensor  data  travel  in  both  directions  in  the  hierarchy.  The  final  two  points  imply 
informal  activities  that  are  difficult  to  implement  in  automated  systems. 

An  AUV  at  any  level  in  the  hierarchy,  when  it  receives  a  command,  must  choose 
from  a  number  of  strategies.  Decisions  are  made  by  evaluating  $-functions  for  the 
behaviors  defining  a  strategy.  Current  network  and  AUV  states  are  used  as  data  by  the 
$-function.  $-functions  can  be  derived  in  a  number  of  ways.  Much  of  computer 
science  is  based  on  deductively  deriving  asymptotic  measures  of  algorithm 
computational  complexity  based  on  characteristics  of  the  data  input.  They  provide 
order  of  magnitude  equations  for  best,  worst,  or  possibly  average  algorithm 
performance  based  on  input  volume.  Constants  are  irrelevant  in  asymptotic  measures, 
since  at  some  point  the  value  of  a  higher  order  factor  will  be  greater  than  a  lower 
order  one;  no  matter  what  constants  are  used.  These  measures  are  useful  for 
determining  algorithm  scalability,  but  inadequate  for  deciding  between  two  specific 
alternatives  where  constant  factors  are  relevant.  In  addition,  average  complexity 
measures  are  generally  based  on  questionable  assumptions  concerning  the  statistical 
distribution  of  input  data,  such  as  assuming  all  inputs  are  equally  likely  . 
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Computational  complexity  is  almost  irrelevant  to  NP-complete  problems. 
However,  computational  complexity  does  provide  a  starting  point  for  defining  $- 
functions.  In  some  cases,  deduction  alone  can  provide  useful  functions.  In  other  cases, 
especially  when  noise  is  an  important  factor,  deduction  is  insufficient.  If  that  is  so, 
empirical  testing  can  be  used.  Testing  is  often  simulation,  where  a  number  of  runs  are 
replicated  with  controllable  factors  set  to  fixed  values  and  uncontrollable  factors 
given  random  values.  Results  from  a  large  number  of  tests  provide  data  points,  which 
can  be  used  to  derive  functional  approximations.  Derivation  of  functional 
approximation  can  be  done  using  a  number  of  approaches,  including  statistical 
regression  [8],  rough  sets  [5],  [14],  and  visualization  [6],  [9].  The  $-functions  found 
should  then  be  tested  to  verify  their  ability  to  approximate  the  desired  quality 
measures.  Tests  could  involve  either  simulations  or  preferably  physical  experiments 
for  AUV's. 

The  AUV’s  evaluate  $-functions  using  values  for  the  relevant  factors,  which 
express  the  current  physical  environment.  Two  natural  limits  exist  to  this  approach: 
not  all  relevant  factors  can  always  be  known  with  sufficient  certainty,  and  the 
physical  environment  is  subject  to  change.  For  that  reason,  we  limit  our  optimization, 
performing  what  we  call  k-O  optimization.  The  variable  k  refers  to  the  limited 
horizon  for  optimization,  necessary  due  to  the  unpredictable  dynamic  nature  of  the 
environment.  The  variable  D  refers  to  a  reduced  alphabet  of  information.  No  AUV 
ever  has  reliable  information  about  all  factors  that  influence  all  AUV's  participating  in 
a  mission.  To  compensate  for  this,  we  mask  factors  where  information  is  not  available 
from  consideration;  reducing  the  alphabet  of  variables  used  by  the  $-function.  This 
can  be  done  by  substituting  a  constant  value,  or  a  default  function  for  the  masked 
variables. 

This  approach  allows  each  AUV  to  choose  between  strategies  and  accomplish  its 
mission.  $-functions  provide  a  metric  for  comparing  alternatives.  By  using  k-Q 
optimization  to  find  the  strategy  with  the  lowest  $-function,  the  AUV  finds  a 
satisficing  solution.  This  avoids  wasting  time  trying  to  optimize  behavior  beyond  the 
foreseeable  future.  It  also  limits  consideration  to  those  issues  where  relevant 
information  is  available.  This  approach,  using  local  optimizations  to  find  globally 
acceptable  satisficing  solutions,  can  be  generalized  to  other  genetic  approaches. 


6  Conclusions 

As  part  of  an  ONR  program,  a  testbed  is  being  established  for  combining 
heterogeneous  AUV's  for  oceanographic  sampling.  A  language  describing  generic 
AUV  behaviors  will  be  used  to  communicate  between  vehicles  designed  by 
independent  research  groups.  Part  of  the  language  is  a  process  algebra,  which  uses 
evolutionary  primitives  like  mutation.  The  process  algebra  provides  a  framework  for 
limited  optimization.  This  optimization  contains  genetic  algorithms  and  neural 
networks  as  limiting  cases.  Limiting  the  optimization  allows  it  to  be  performed  in 
real-time.  Currently,  the  work  is  underway  to  implement  the  Generic  Behavior 
Message-Passing  Language,  to  experiment  with  cooperation  and  emerging  behavior 
using  resource-bounded  optimization,  and  integrating  Virtual  Environment  for 
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nowcasting  and  forecasting  based  on  models  and  data  form  Harvard  and  Rutgers 
Universities. 

Acknowledgments.  We  would  like  to  recognize  the  contributions  of  multiple  co¬ 
workers  to  this  project.  They  include  Dr.  J.  Stover,  Dr.  E.  Peluso,  Dr.  P.  Stadter,  Marc 
Lattorela,  Jen  Ryland,  and  others  too  numerous  to  mention. 

References 

1.  Brutzman  D.,  et  al:  The  Phoenix  Autonomous  Underwater  Vehicle,  in  AI-Based 
Mobile  Robots,  MIT/AAAl  Press,  (1998) 

2.  Eberbach  E.:  SEMAL:  A  Cost  Language  Based  on  the  Calculus  of  Self-Modifiable 
Algorithms,  International  Journal  of  Software  Engineering  and  Knowledge 
Engineering,  vol.  4,  no.  3,  (1994)  391-400. 

3.  Eberbach  E.:  Enhancing  Genetic  Programming  by  $-calculus,  Proc.  of  the  Second 
Annual  Genetic  Programming  Conference  GP-97,  Morgan  Kaufmann,  (1997),  88  (a 
complete  version  in  Proc.  of  the  Tenth  Australian  Joint  Conf.  on  AI  Ar97,  The  ACS 
Nat.  Committee  on  AI  and  ES,  Perth,  Australia,  (1997),  77-83) 

4.  Garzon  M.:  Models  of  Massive  Parallelism:  Analysis  of  Cellular  Automata  and 
Neural  Networks,  Springer-Verlag,  (1995) 

5.  Guan  J.W.,  Bell  D.A.:  Rough  computational  methods  for  information  systems. 
Artificial  Intelligence,  vol.  105,  77-103. 

6.  Keim  D.A.,  Kriegel  H.-P.:  Visualization  Techniques  for  Mining  Large  Databases: 
A  Comparison,  IEEE  Transactions  on  Knowledge  and  Data  Engineering,  vol.  8,  no.  6, 
(Dec.  1996)  923-938 

7.  Milner  R.,  Parrow  J.,  Walker  D.:  A  Calculus  of  Mobile  Processes,  I  &  II, 
Information  and  Computation  100,  (1992)  1-77, 

8.  Montgomery  D.C.:  Design  and  Analysis  of  Experiments,  Wiley,  New  York,  NY. 

9.  Nielson  G.M.,  Hagen  H.,  Mueller  H.,  Scientific  Visualization,  IEEE  Computer 
Society,  Washington,  DC  (1997) 

10.  Phoha  S.  et  al:  A  Mobile  Distributed  Network  of  Autonomous  Undersea 
Vehicles,  Proc.  of  the  24*’’  Annual  Symposium  and  Exhibition  of  the  Association  for 
Unmanned  Vehicle  Systems  International,  Baltimore,  MD,  (1997) 

11.  Phoha  S.  et  al:  Ocean  Sampling  Mobile  Network  Controller,  Sea  Technology, 
(Dec.  1997)  53-58. 

12.  Smith  S.  et  al,  The  Ocean  Explorer  AUV:  A  Modular  Platform  for  Coastal 
Oceanography,  Proc.  of  the  9*’’  Intern.  Symp,  On  Unmanned  Untethered  Submersible 
Technology,  pp.  67-75,  1995. 

13.  Stover  J.A,,  Hall  D.L.:  Fuzzy-Logic  Architecture  for  Autonomous  Multisensor 
Data  Fusion,  IEEE  Transactions  on  Industrial  Electronics.  V.  43,  no.  3.  (Jun  1996) 
403-410 

14.  Tsaptsinos  D.:  Rough  Sets  and  ID3  Rule  Learning:  Tutorial  and  Application  to 
Hepatitis  Data,  Journal  of  Intelligent  Systems,  vol.  8,  no.  1-2,  (1998)  203-223 

15.  Wegner  P.:  Why  Interaction  is  More  Powerful  Than  Algorithms,  CACM,  vol.40, 
no.5,(May  1997)  81-91 

16.  Zilberstein  S.:  Operational  Rationality  through  Compilation  of  Anytime 
Algorithms,  Ph.D.  Dissertation,  Dept,  of  Computer  Science,  Univ.  of  California  at 
Berkeley  (1993) 


Ontology-Based  Multi-agent  Model  of  an 
Information  Security  System 


V.I.  Gorodetski',  L.J.  Popyack^,  I.V.  Kotenko’,  V.A.  Skormin^ 


1  -  St  “Petersburg  Institute  for  Informatics  and  Automation.  E-mail:  gor@mail.iias.spb.su 
2  -  USAF  Research  Laboratory,  Information  Technology  Division,  E-mail:popyack@rl.af.mil 

3  -  St  “Petersburg  Signal  University.  E-mail:  ivkote@robotek.ru 

4  -  Binghamton  University,  E-mail:  vskormin@binghamton.edu 


Abstract  The  paper  is  focused  on  a  distributed  agent-based  information  secu¬ 
rity  system  of  a  computer  network.  A  multi-agent  model  of  an  information  se¬ 
curity  system  is  proposed.  It  is  based  on  the  established  ontology  of  the  infor¬ 
mation  security  system  domain.  Ontology  is  used  as  a  means  of  structuring  dis¬ 
tributed  knowledge,  utilized  by  the  information  security  system,  as  the  common 
ground  of  interacting  agents  as  well  as  for  the  agent  behavior  coordination. 

Keywords:  multi-agent  system,  information  security,  ontology. 


1  Introduction 

Existing  computer  security  systems  consist  of  a  number  of  independent  components 
that  require  an  enormous  amount  of  distributed  and  specialized  knowledge  facilitating 
the  solution  of  their  specific  security  sub-problems.  Often,  these  systems  constitute  a 
bottleneck  of  the  throughput,  reliability,  flexibility  and  modularity  of  the  computa¬ 
tional  process.  A  modern  information  security  system  (ISS)  should  be  considered  as  a 
number  of  independent,  largely  autonomous,  network-based,  specialized  software 
agents  operating  in  a  coordinated  and  cooperative  fashion  designated  to  prevent  par¬ 
ticular  kinds  of  threats  and  suppressing  specific  types  of  attacks.  The  modem  multi¬ 
agent  system  technology  presents  a  valuable  approach  for  the  development  of  an  ISS 
that,  when  implemented  in  a  distributed  large  scale  multi-purpose  information  system, 
is  expected  to  have  important  advantages  over  existing  computer  security  technolo¬ 
gies.  An  ontology-based  multi-agent  model  of  an  ISS  is  considered  herein.  In  Section 
2,  the  conceptual  level  of  an  ISS  model  is  outlined.  In  Section  3,  we  propose  the  on¬ 
tology  of  an  information  security  domain.  The  topology  of  a  task-oriented  distributed 
agent’s  knowledge  and  belief,  providing  a  common  ground  for  agent  information  ex¬ 
change  and  utilized  for  agent  behavior  coordination  and  mutual  imderstanding,  is  con¬ 
sidered.  In  Section  4,  we  outline  the  ISS  architecture  and  general  principles  of  agents* 
negotiation  and  coordination  within  an  agent-based  ISS.  In  Section  5,  modeling  ap¬ 
proach  of  an  ISS  is  described.  Section  6  contains  brief  analysis  of  relevant  research 
associated  with  agent-based  ISS.  In  conclusion,  we  outline  the  main  results  and  future 
work  aimed  at  utilizing  agent-based  technology  for  the  ISS  development. 
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2  Conceptual  Agent-Based  Model  of  ISS 

Conceptually,  a  multi-agent  ISS  is  viewed  as  a  cooperative  multitude  of  the  following 
types  of  agents,  distributed  both  across  the  network  and  on  the  host  itself.  (1)  Access 
control  agents  that  constrain  access  to  the  information  according  to  the  legal  rights  of 
particular  users  by  realization  of  discretionary  access  control  rules  (ACR)  specifying 
to  each  pair  "subject  -  object"  the  authorized  kinds  of  messages.  Various  access  con¬ 
trol  agents  cooperate  for  the  purpose  of  maintaining  the  compliance  with  discretionary 
ACR  on  various  sites  of  network.  These  agents  supervise  the  flows  of  confidential 
information  by  realization  of  mandatory  ACR  not  admitting  an  interception  of  confi¬ 
dential  information.  (2)  Audit  and  intrusion  detection  agents  detecting  non-authorized 
access  and  alerting  the  responsible  system  (agent)  about  potential  occurrence  of  a 
security  violation.  As  a  result  of  statistical  processing  of  the  messages  formed  in  the 
information  system,  these  agents  can  stop  data  processing,  inform  the  security  man¬ 
ager,  and  specify  the  discretionary  ACR.  A  statistical  learning  process,  crucial  for  the 
successful  operation  of  these  agents,  is  implemented.  It  utilizes  available  information 
about  normal  system  operation,  possible  anomalies,  non-authorized  access  channels 
and  probable  scripts  of  attacks.  (3)  Anti-intrusion  agents  responsible  for  pursuing, 
identifying  and  rendering  harmless  the  attacker.  (4)  Diagnostic  and  information  re¬ 
covery  agents  accessing  the  damage  of  non-authorized  access.  (5)  Cryptographic, 
steganography  and  steganoanalysis  agents  providing  safe  data  exchange  channels 
between  the  computer  network  sites.  (6)  Authentication  agents  responsible  for  the 
identification  of  the  source  of  information,  and  whether  its  security  was  provided 
during  the  data  transmission  that  provides  the  identity  verification.  They  assure  the 
conformity  between  the  functional  processes  implemented  and  the  subjects  initiated 
by  these  processes.  While  receiving  a  message  from  a  functional  process,  these  agents 
determine  the  identifier  of  the  subject  for  this  process  and  transfer  it  to  access  control 
agents  for  realization  of  discretionary  ACR.  (7)  Meta-agents  that  carry  out  the  man¬ 
agement  of  information  security  processes,  provide  coordinated  and  cooperated  be¬ 
havior  of  the  above  agents  and  assure  the  required  level  of  general  security  according 
to  a  global  criteria. 


3  Ontology  of  Information  Security  Domain 

Agents  of  a  multi-agent  ISS,  performing  the  global  information  security  task  in  a  dis¬ 
tributed  and  cooperative  fashion,  must  communicate  by  exchanging  messages.  Mes¬ 
sage  exchange  requires  that  the  agents  are  able  to  "understand",  in  some  sense,  each 
other.  Mutual  agent  understanding  implies  that  each  agent  "knows"  (1)  what  kind  of 
task  it  must  and  is  able  to  execute,  (2)  what  agent(s)  it  has  to  address  when  requesting 
help  if  its  functionality  and/or  available  information  are  not  sufficient  for  dealing  with 
a  problem  within  its  scope  of  responsibility,  and  (3)  what  are  the  forms  and  terms  of 
message  representation  that  are  understood  by  the  addressee.  Therefore,  each  agent 
must  possess  its*  own  model  and  models  of  other  agents. 
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One  of  the  most  promising  approaches  to  model  the  distributed  agents’  knowledge, 
beliefs  constituting  the  common  ground  of  an  entire  multi-agent-system,  is  the  utili¬ 
zation  of  domain  ontology  [3].  Like  any  other  domain,  ontology  of  the  information 
security  domain  is  a  description  of  the  partially  ordered  concepts  of  this  domain  and 
the  relationships  over  them  that  should  be  used  by  the  agents.  This  ontology  de¬ 
scribes,  in  a  natural  way,  ontological  commitments  for  a  set  of  agents  so  that  they 
might  be  able  to  communicate  about  a  domain  of  discourse  without  a  necessary  op¬ 
eration  of  a  globally  shared  theory.  In  such  ontology,  definitions  associate  the  names 
of  entities  in  the  space  of  discourse  with  human-readable  text  describing  the  meanings 
of  names,  and  formal  axioms  that  constrain  the  interpretation  and  well-established  use 
of  these  terms  [3].  A  part  of  the  developed  fragments  of  the  information  security  do¬ 
main  ontology,  that  is  associated  with  the  tasks  of  agents  responsible  for  auditing, 
detecting  non-authorized  access,  and  authentication,  is  depicted  in  Fig.  1. 
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Fig.l.  Fragment  showing  ontology  of  information  security  domain 


4  ISS  Architecture  and  General  Principles  of  Agents'  Negotiation 

Consider  a  number  of  basic  ISS  construction  principles  functioning  as  a  community 
of  integrated  agents  distributed  in  a  network  environment  and  allocated  on  several 
hosts.  Each  security  agent  should  be  host-based  and  operate  on  some  segment  of  the 
computer  network.  In  this  case,  we  assume  that  each  meta-agent  is  host-based  as  well. 
A  meta-agent  manages  a  set  of  the  above-mentioned  specialized  agents,  that,  in  turn, 
receive  information  from  the  agents-"demons"  investigating  the  input  traffic  (login, 
password,  etc.).  The  agents-demons  perform  monitoring  of  the  input  traffic  to  differ- 
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ent  servers  located  on  the  same  host.  In  essence,  they  are  software  sensors  that  form 
various  metrics  of  input  traffic.  All  agents  are  expected  to  communicate  that  enables 
the  ISS  to  detect  attacks  on  the  network  when  intrusion  attempts  are  undertaken  "lo¬ 
cally"  and  (or)  serially,  even  when  each  individual  intrusion  attempt  cannot  be  inter¬ 
preted  as  an  intrusion.  The  offered  set  of  agents  can  reside  on  any  host  and  can  coop¬ 
erate  through  the  meta-agent,  which  operates  with  the  "top  level"  knowledge  base  and 
makes  conclusions  within  the  framework  of  one  host.  The  information  interchange 
between  hosts  is  carried  out  either  on  a  peer-basis,  or  by  means  of  the  meta-agent 
acting  as  the  network  layer  manager. 

The  subsets  of  nodes  and  relations  of  ontology,  used  by  particular  agents  for  task 
solving,  are  determined  by  agents’  functions.  The  nodes  placed  on  the  intersections  of 
ontology  fragments,  reflecting  the  functions  of  two  individual  agents,  constitute  the 
shared  knowledge  jointly 
used  by  both  agents  in  deci¬ 
sion  making  (see  Fig.2).  As¬ 
sume  that  in  order  to  make  a 
decision,  agent  2  needs  to 
access  knowledge  of  nodes  1 
and  2.  But  agents  1  and  3 
have  more  detailed  knowl¬ 
edge  associated  with  these 
nodes.  Therefore,  agent  2 
should  receive  this  knowledge  from  them.  Similar  situations  will  take  place  during  the 
interaction  of  other  agents.  It  could  be  seen  that  agent  2  “is  aware”  only  of  nodes  1 
and  2,  it  formulates  its  request  in  the  terms  understood  by  agents  1  and  3,  receives 
from  them  knowledge,  and  is  able  to  interpret  it  correctly. 


Fig.2.  Representation  of  agents'  ontologies  intersection 


5  ISS  modeling  approach 

To  demonstrate  the  validity  of  our  approach  we  are  in  the  process  of  developing  a 
modeling  testbed  of  an  ISS.  The  hardware  part  of  the  testbed  includes  local  computer 
network  that  has  access  to  Internet.  The  software  part  is  based  on  Unix  and  Windows 
OS  and  specialized  agent-based  program  package  that  is  being  developed  in  Java  and 
Visual  C-H-.  At  the  first  step  of  the  testbed  realization,  the  attack  intrusion  detection 
environment  was  built.  It  includes  facilities  for  network  attack  modeling,  simple 
agents  investigating  the  input  traffic,  intrusion  detection  agents  and  meta-agents. 


6  Related  works 

Many  existing  and  proposed  ISSs  use  a  monolithic  architecture.  Several  approaches 
that  exploit  the  idea  of  distributed  ISS  are  given  in  [2, 4, 6, 7].  There  exist  few  papers, 
for  example,  [1,  5,  8,  9]  that  consider  an  agent-based  approach  for  an  information 
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security  system  design.  Unfortunately,  these  papers  (1)  restrict  themselves  by  solving 
only  intrusion  detection  task,  (2)  do  not  pay  needed  attention  to  the  agent  cooperation 
problem  and  multi-agent  system  architecture,  (3)  ignore  advantages  of  using  intelli¬ 
gent  agents.  Nevertheless,  even  such  a  relatively  simple  agent-based  approach  as  a 
model  of  ISS  leads  to  a  number  of  advantages  such  as  efficiency,  fault  tolerance,  re¬ 
silience  to  subversion,  scalability,  etc.  In  our  approach  we  have  borrowed  the  idea  to 
overcome  all  of  these  shortcomings. 


7  Conclusion  and  Future  Work 

In  this  paper  a  multi-agent  model  of  ISS  is  proposed  based  on  ontology.  The  main 
paper  results  include:  (1)  development  of  information  security  domain  ontology,  that 
is  associated  with  the  multitude  of  information  security  tasks  under  consideration,  and 
that  is  considered  as  the  framework  for  distributed  common  knowledge  and  agent's 
individual  knowledge  development  and  representation;  (2)  development  of  an  agent- 
based  architecture  of  ISS  that  aims  at  solving  the  entire  multitude  of  problems  related 
to  particular  tasks.  In  the  future  work  it  is  planned  to  develop  the  domain  ontology, 
the  agent-based  architecture  and  the  formal  frameworks  for  distributed  knowledge 
and  beliefs  representation  in  more  detail.  One  more  intention  is  to  exploit  "learning  by 
feedback"  methods  to  provide  ISS  by  real-time  adaptation  properties. 
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Abstract.  Wavelet  analysis  as  a  recently  data  filtering  method  (or  multi-scale 
decomposition)  is  particularly  useful  for  describing  signals  with  sharp  spiky, 
discontinuous  or  fractal  structure  in  financial  markets. 

This  study  investigates  the  optimal  several  wavelet  thresholding  criteria  or 
techniques  to  support  the  multi-signal  decomposition  methods  of  a  daily  Korean 
won  /  U.S.  dollar  currency  market  as  a  case  study,  specially  for  the  financial 
forecasting  with  a  neural  network.  The  experimental  results  show  that  a  cross- 
validation  technique  is  the  best  thresholding  criterion  of  all  the  existing 
thresholding  techniques  for  an  integrated  model  of  the  wavelet  transformation 
and  the  neural  network. 

Key  words:  Discrete  Wavelet  Transform,  Wavelet  Packet  Transform,  Wavelet 
Thresholding  Techniques,  Neural  Networks,  Nonlinear  Dynamic  Analysis 


1  Introduction 

Traditionally,  the  fluctuation  in  financial  market  is  treated  as  white  noise.  However,  it 
is  not  true  when  trend  is  properly  removed  and  we  can  clearly  observe  some  business 
cycles,  though  they  evolve  with  time.  The  goal  of  forecasting  is  to  identify  the  pattern 
in  the  time  series  and  use  the  pattern  to  predict  its  future  path. 

The  issue  of  generalization  in  this  interpretation  becomes  one  of  how  to  extract 
useful  information  from  the  noise-contaminated  data,  and  to  rebuild  the  pattern  as 
closely  as  possible,  while  ignoring  the  useless  noises. 

Specially,  the  joint  time-frequency  filtering  techniques  such  as  wavelet  transforms 
also  have  been  shown  to  be  useful  in  estimating  coefficients  of  forecasting  models. 
The  principal  advantage  of  applying  the  filtering  methods  is  that  the  techniques  make 
it  possible  to  isolate  relevant  frequencies. 

During  the  last  decade  a  new  and  very  versatile  technique,  the  wavelet  transform 
(WT),  has  been  developed  as  a  unifying  framework  of  a  number  of  independently 
developed  methods  (Mallat  [34],  Daubechies  [18]).  Recently,  the  literatures  about  the 
applications  of  wavelet  analysis  in  financial  markets  were  introduced  (See  Table  1). 

One  of  the  most  important  problems  that  has  to  be  solved  with  the  application  of 
wavelet  filters  is  the  correct  choice  of  the  filter  type  and  the  filter  parameters.  The 
most  difficult  choice  is  that  of  the  cut-off  frequency  of  the  filter  which  has  to  be 
specified  either  explicitly  or  implicitly  (Mittermayr  et  al  [37]). 
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This  study  is  intended  to  explore  the  wavelet  universal  thresholding  algorithms  to 
denoise  data  and  compare  its  performance  on  the  basis  of  the  root  mean  squared  error 
(RMSE)  with  that  of  other  commonly  used  smoothing  filters  in  financial  forecasting. 

We  also  evaluate  the  effectiveness  of  both  these  transform  such  as  discrete  wavelet 
transform  and  wavelet  packet  transform  on  daily  Korean  Won  /  US  Dollar  exchange 
rate  market. 

The  remainder  of  this  study  is  organized  as  follows.  The  next  section  reviews  time- 
frequency  decomposition,  and  then  discrete  wavelet  transform  (DWT)  and  wavelet 
packet  transform  (WPT).  Section  3  introduces  thresholding  techniques  for  financial 
forecasting  and  describes  best  basis  selection  and  best  level  criteria  techniques  (Tree 
Pruning  Algorithm).  Section  4  describes  our  model  framework  and  Section  5  analyzes 
our  experimental  results.  Finally  Section  6  contains  final  comments. 


Table  1 .  Prior  Case  Studies  Using  Wavelet  Transform  Techniques  Applied  to  Financial  Markets 

I  T  I  I  I 


Author  (Year) 

Purpose 

Data 

Methodology 

Results 

Pancham 

(1994) 

Test  the  multi-fractal 
market  hypothesis 

Monthly,  weekly, 
daily  Index 

- 

■ 

Accepted  the 
multi-fractal 
market  hypo. 

Cody (1994) 

Present  the  concept  of 
wavelets  and  the  WT 
methods 

General  financial 
market  data 

DWT,  WPT 

Multi-scale  linear 
prediction  system 

Suggested 
possible 
applications  of 
the  DWT  to 
financial 
market  analysis 

Tak(l995) 

Forecasting  univariate  time 
series 

Standard  &  Poor’ s 
500  index 

Mexican-hat 

wavelet 

ARIMA,  detrending 
and  AR,  random 
walk,  ANN 

Outperformed 
than  original 
data 

Greenblatt 

(1996) 

Analysis  for  structure  in 
financial  data 

Foreign  exchange 
rates 

Coif-1, Coif-5 

Best  orthogonal 
basis.  Matching 
pursuit,  Method  of 
frames,  Basis  pursuit 

Found 
structure  in 
financial  data 

McCabe  and 
Weigend 
(1996) 

Determine  at  which  time- 
scale  the  series  is  most 
predictable 

DM/US  Dollar 

Haar  wavelet 

Predictive  linear 
models  for 
multiresolution 
analysis 

Rarely  better 
than  predicting 
the  mean  of  the 
process 

Hog  (1996) 

Estimate  the  fractional 
differencing  parameter  in 
Fractional  Brownian 
Motion  models  for  interest 
rate  having  the  term 
structure 

Monthly  US  5-year 
yields  on  pure 
discount  bonds 
(1965.11-1987.02) 

Haar  wavelet 

ARFIMA{0,d+l,0) 
where  H=d+ 1/2 

5^  =  0.900 

95% 

confidence 
interval  for  d  = 
[0.8711, 
0.9289] 

Hog  (1997) 

Analyze  non-stationary  but 
possibly  mean-reverting 
processes 

US  interest  rate 

Haar  wavelet 

ARFIMA 

Showed  mean 
reversion  of 

US  interest  rate 

Aussem  etai 
(1998) 

Predict  the  trend-up  or 
down  -  5  days  ahead 

S&P  500  closing 
prices 

Atrous 

wavelet 

Dynamic  recurrent 
NN  &  1  nearest 
neighbors 

86%  correct 
prediction  of 
the  trend 
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2  Discrete  Wavelet  Transform  and  Wavelet  Packet  Transform 

Recently,  local  atomic  decompositions  (wavelets,  wavelet  libraries)  have  become 
popular  for  the  analysis  of  deterministic  signals  as  an  alternative  to  non-local  Fourier 
representations.  The  Fourier  transform  is  usually  not  to  be  used  in  case  of  non¬ 
stationary  signals. 

Each  scale  of  wavelet  coefficients  provides  a  different  dimension  of  the  time  series 
in  the  both  time  and  frequency  domains.  Recently,  due  to  the  similarity  between 
wavelet  decomposition  and  the  idea  of  combining  both  wavelet  and  NN  has  been 
proposed  in  various  works  (Bakshi  and  Stephanopoulos  [4],  [5];  Delyon  et  al.  [19]; 
Geva  [25];  Zhang  [52];  Zhang  and  Benveniste  [51]). 

Recently,  the  wavelet  transform  was  introduced  as  an  alternatively  technique  for 
time-frequency  decomposition  (Daubechies  [17],  [18]).  Wavelets  are  any  of  a  set  of 
special  functions  satisfying  certain  regularity  conditions.  Their  support  is  finite;  they 
are  non-zero  on  a  finite  interval,  and  they  are  defined  within  finite  frequency  bands.. 

WT  is  a  powerful  method  for  multiresolution  representation  of  signal  data  (Szu  et 
al,  [44]).  The  discrete  wavelet  transform  (DWT)  expresses  a  time  series  as  a  linear 
combination  of  scaled  and  translated  wavelets.  Knowing  which  wavelets  appear  in  a 
transform  can  provide  information  about  the  frequency  content  of  the  signal  for  a 
short  time  period. 

DWT  is  generally  calculated  by  the  recursive  decomposition  algorithm  known  as 
the  pyramid  algorithm  or  tree  algorithm  (Mallat  [34]),  which  offers  the  hierarchical, 
multiresolution  representation  of  function  (signal).  As  shown  in  Fig.  1(a),  in  the  tree 
algorithm,  the  set  of  input  data  is  passed  through  the  scaling  and  the  wavelet  filters. 


Level  * 


Level  1 


Uvel2 


G|  I  H 


Gl  ^  H 


a”  original  signal 

Level  0 

1  original  signal 

- - - 1 

i  n 

_ 1 

G| - 

1  “ 

a' 

d' 

j  Level  1 

a' 

d' 

- 1 - 

Level  2 


Lit  3 


(P 


lii 


g7-ShgP^  orS  hgH^ 


Level  p 


ap 


dP 


(a)  Tree  or  Pyramid  algorithm 
(Mallat,  1989) 


Level  p 


dP 


dP 


aP 


(b)  Wavelet  packet  transform 
(Coifman  et  al.,  1993) 


Fig,  1..  Discrete  Wavelet  Transform  and  Wavelet  Packet  Transform  (G:  The  lowpass 
(or  scaling)  filter;  H:  The  highpass  (or  wavelet)  filter;  d^:  =  {dot  d,t  dN/2.1'’},  the 
detail  coefficients  (highpass  filtered  data)  at  the  pth  level  of  resolution;  a^:  =  (ao**,  3,^ 

•  ^N/2.i^}»  the  approximation  coefficients  (lowpass  filtered  data)  at  the  pth  level  of 

resolution.) 

Coifman  and  Meyer  [12]  develop  wavelet  packet  functions  as  generalization  of 
wavelets  (DWT).  In  the  pyramid  algorithm  the  detail  branches  are  not  used  for 
further  calculations,  i.e.  only  the  approximations  at  each  level  of  resolution  are  treated 
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to  yield  approximation  and  detail  obtained  at  level  m+1.  Application  of  the  transform 
to  both  the  detail  and  the  approximation  coefficients  results  in  an  expansion  of  the 
structure  of  the  wavelet  transform  tree  algorithm  to  the  full  binary  tree  (Coifman  and 
Wickerhauer  [15];  Coifman  et  al  [14]). 

The  main  difference  is  that  while  in  the  DWT  the  detail  coefficients  are  kept,  and 
the  approximation  coefficients  are  further  analyzed  at  each  step,  in  the  WPT  both  the 
approximation  signal  and  the  detail  signal  are  analyzed  at  each  step.  This  results  in 
redundant  information,  as  each  level  of  the  transform  retains  n  samples.  The  process 
is  illustrated  in  Fig.  1(b). 


3  Wavelet  Thresholding  Techniques  As  Optimal  Signal 
Decomposition  for  Financial  Forecasting 

Thresholding  is  a  rule  in  which  the  coefficients  whose  absolute  values  (energies)  are 
smaller  than  a  fixed  threshold  are  replaced  by  zeroes.  The  purpose  of  thresholding  is 
to  determine  which  are  the  good  coefficients  to  keep,  so  as  to  minimize  the  error  of 
approximation. 

In  this  study,  we  define  wavelet  thresholding  techniques  as  denoising,  and 
smoothing  techniques  including  best  basis  selection  and  best  level  algorithm  to 
extract  significant  multi-scale  information  from  the  original  time  series. 

Several  approaches  to  thresholding  have  been  introduced  in  the  literature  (See 
Table  2). 


Table  2.  Wavelet  Thresholding  Techniques 


Authors(Year) 

Thresholding  Rules 

Donoho  and  Johnstone  (1994) 

Universal(V!SuShrink) 

-  Minimax  approach 

A  =  y^2log(n)5  ’ 

6{d.A)  =  d 

for  all  the  wavelet  coefficients  d 

Donoho  and  Johnstone  (1995) 

Adaptive  (SureShrink) 

-  Minimax  approach 

Based  on  Estimator  of  Risk 

Nason(1994,1995,1996), 
Weyrich  and  Warhola,  1995) 
Jensen  and  Bultheel  (1997) 

Cross-Validation 

cv=^J;^(y,-y,y 

1  ^  hi 

Abramovich  and  Benjamin! 
(1995, 1996), 

Ogden  and  Parzen  (1996a,  1996b) 

Multiple  hypothesis  tests 

Test  if  each  wavelet  coefficient  is 
zero  or  not. 

Vidakovi6  (1994),  Clyde  et  al. 
(1995),  Chipman  et  al.  (1997) 

Bayes  Rule 

Goel  and  Vidakovic  (1995) 

Lorentz  curve 

po  =  ^  1  (c/'  ^  d  ) 

Abramovich  and  Benjamin i 
(1995) 

The  False  Discovery  Rate 
(FDR)  approach  to  multiple 
hypo.  Testing 

Johnstone  and  Silverman  (1997) 

Wang  (1996),  Johnstone  and 
Silverman  (1997) 

Correlated  errors 
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Table  3.  Wavelet  Packet  Basis  Selection  Algorithms 


Authors  (Year) 

Basis  Selection 
Algorithms 

Contents 

Daubechies  (1988) 

Method  of 
Frames  (MOF) 

-  Synthesis  direction  approach 

1  -A  straight-forward  linear  algebra 

Coifman  and  Wickerhauser 
(1992) 

Best  Orthogonal 
Basis 

-  Shannon  entropy 

-  Bottom-up  tree  searches 

Mallat  and  Zhang  (1993) 

-Synthesis  direction  approach 

Chen  (1995),  Chen  and  Donoho 
(1995b),  Chen  etal.  (1998) 

Basis  Pursuit 

-  Similar  to  MOF 
-A  large-scale  constrained  opt. 

Donoho (1995b) 

CART 

-  Shannon  entropy 

In  wavelet  packet  functions  as  generalization  of  wavelets  (DWT),  a  best  basis  can 
explicitly  contain  the  criterion  of  the  coefficient  selection.  For  stance,  the  best  basis 
can  be  defined  as  the  basis  with  the  minimal  number  of  coefficients,  whose  absolute 
value  is  higher  than  the  predefined  threshold. 

Besides,  best  level  algorithm  (Coifman  et  al.  [13])  computes  the  optimal  complete 
sub-tree  of  an  initial  tree  with  respect  to  an  entropy  type  criterion.  The  resulting 
complete  tree  may  be  of  smaller  depth  than  the  initial  one.  The  only  difference  from 
best  basis  selection  algorithms  is  that  the  optimal  tree  is  searched  among  the  complete 
sub-tree  of  the  initial  tree. 


4  Research  Model  Architecture 

Our  study  is  to  analyze  wavelet  thresholding  or  filtering  methods  for  extracting 
optimal  multi-signal  decomposed  series  (i.e.  highpass  and  lowpass  filters)  as  a  key 
input  variable  fitting  a  neural  network  based  forecasting  model  specially  under 
chaotic  financial  markets  (See  Fig.  2). 


Neural  Network  Architectum^^ 


Nonlinear 

Dynamic 

Analysis 


( 


Multi-Scale  decomposiOon 


input  x(t) 


x(t-1) 

x(t-2) 

x(t-3) 


(Hill  Climbing) 


Theory  bas9d  Of  OatB-Orbnn 

Thn^nUing  criforio  (X) 
for  Optimal  MulO-sci^e 
DacomposHlon 


Fig.  2.  Integration  Framework  of  Wavelet  Transformation  and  Neural  Networks 


4.1  Nonlinear  Dynamic  Analysis 

In  the  chaos  theory,  it  is  proved  that  the  original  characteristics  of  the  chaos  can  be 
reconstructed  from  a  single  time  series  by  using  a  proper  embedding  dimension. 
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In  this  study,  we  use  the  dimension  information  specially  to  determine  the  size  of 
time  lagged  input  variables  of  neural  network  models.  For  example,  the  embedding 
dimension,  5  estimated  in  our  study  indicates  that  4  time-lag  data  are  matched  to  input 
factors  of  a  neural  network  to  predict  the  5th  data  point  of  the  time  series. 


4.2  Neural  Networks 

For  time  series  predictions,  the  most  popularly  used  neural  networks  are  clearly  time 
delay  neural  networks  (TDNN;  Weigend  et  al.  [49])  and  recurrent  neural  networks 
(RNN;  Elman  [24]).  While  in  the  dynamic  context  the  recurrent  neural  networks  can 
outperform  the  time  delay  neural  networks,  they  occasionally  are  difficult  to  be 
trained  optimally  by  a  standard  backpropagation  algorithm  due  in  part  to  the 
dependence  of  their  network  parameters  (Kuan  and  Homik  [33]). 

In  this  study,  The  basic  model  we  experiment  with  is  Backpropagation  neural 
network  (BPN)  models  which  have  a  parsimonious  4  input  nodes,  4  hidden  nodes  and 
1  output  node  with  single  wavelet  filter,  i.e.  highpass  or  lowpass  filter  within  the 
network  structure.  The  other  model  we  experiment  with  is  BPN  models  which  have  8 
input  nodes,  8  hidden  nodes  and  1  output  node  with  all  the  multiple  filters. 


5  Experimental  Results 


In  this  section,  we  evaluate  prior  methodology  about  wavelet  thresholding  using  a 
case  of  the  daily  Korean  Won  /  U.S.  Dollar  exchange  rates  are  transformed  to  the 
returns  using  the  logarithm  and  through  standardization  from  January  10,  1990  to 
June  25,  1997.  The  learning  phase  involved  observations  from  January  10,  1990  to 
August  4,  1995,  while  the  testing  phase  ran  from  August  7,  1995  to  June  25,  1997. 

We  transform  the  daily  returns  into  the  decomposed  series  such  as  an 
approximation  part  and  a  detail  part  by  Daubechies  wavelet  transform  with  4 
coelficients  for  neural  network  forecasting  models  in  our  study. 

In  summary,  we  use  a  few  thresholding  strategies  shown  in  Table  2,  3  and  then 
compare  each  other  in  forecasting  performance  using  test  samples.  The  results  are 
shown  in  Table  4-6.  In  our  experiments,  lowpass  and  highpass  filters  are  both 
considered  in  the  wavelet  transform,  and  their  complementary  use  provides  signal 
analysis  and  synthesis. 

First,  we  select  the  most  efficient  basis  out  of  the  given  set  of  bases  to  represent  a 
given  signal  (See  Fig.  3.). 


Fig.  3.  WPT  Analysis  Using  Daily  Korean  Won  /  US  Dollar  Returns  Data 
[Parentheses  contain  a  information  about  wavelet  level  index  (left  hand  size)  and 
wavelet  coefficient  index  at  the  same  level  (right  hand  size)] 
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Table  4,  5,  and  6  compare  thresholding  performances  from  different  preprocessing 
methods  in  forecasting  models. 

Firstly,  our  experimental  results  (Table  4-6)  show  that  WTs  have  proved  to  be  very 
good  methods  for  noise  filtering  and  compressing  data.  This  is  doubtlessly  due  to  the 
fact  that  varying  resolution  scales  are  treated,  thus  taking  into  account  a  range  of 
superimposed  phenomena. 

Table  4  and  5  contain  the  comparison  between  hard  and  soft  thresholding.  Soft 
thresholding  is  hardly  different  from  hard  thresholding  in  the  experimental  results. 
Table  4-6  also  show  the  results  about  the  different  performances  among  compression, 
denoising,  best  basis  method,  best  level  method,  and  cross-validation,  etc. 

But,  except  cross-validation  method  by  DWT,  any  other  method  didn't  significantly 
out-perform  the  others  in  viewpoint  of  neural  network  based  forecasting  performance. 
That  is,  only  cross-validation  method  significantly  has  the  best  performance  among 
their  techniques  and  the  other  methods  have  almost  the  same  results. 

However,  the  data  driven  approach  has  some  limitation  as  follows.  That  is,  in  fact, 
varying  results  can  be  obtained  with  different  experimental  conditions  (signal  classes, 
noise  levels,  sample  sizes,  wavelet  transform  parameters)  and  error  measures,  i.e.  a 
cost  function  for  global  model  optimization. 

Ideally,  the  interplay  between  theory  based  and  experimental  (or  data  driven) 
approach  to  implement  an  optimal  wavelet  thresholding  should  provide  the  best 
performance  of  a  model  according  to  the  above  experimental  conditions. 


Table  4.  A  Discrete  Wavelet  Transform  Thresholding  Performance  U sing  Test  Samples 


Threshold 

Techniques 

Threshold 

Strategy 

Filter 

Types 

Network 

Structure 

RMSE 

- 

- 

Random  Walks 

2.939007 

. 

- 

B'TvrrrFTitMl 

1.754525 

Cross-validation 

. 

HP&LP* 

1.676247 

Data  Compression 

Hard 

LP** 

■nSIEE&SH 

1.766189 

Thresholding 

HP&LP 

1.760744 

Data  Denoising 

Soft 

LP 

BPN(4-4-l) 

1.767864 

Thresholding 

HP&LP 

BPN(8-8-i) 

1.751537 

Hard 

Thresholding 

LP 

BPN(4-4-i) 

1.766579 

HP&LP 

BPN(8-8-l) 

1.754131 

a:  Highpass  and  Lowpass  filters,  b:  Lowpass  filter, 

c:  BPN(l-H-O)  =  Backpropagation  NN(I:  Input  Nodes;  H:  Hidden  Nodes;  O:  Output  Nodes). 
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Table  5.  Wavelet  Packet  Transform  Thresholding  Performance  Using  Test  Samples 


Thresholding 

Techniques 

Tresholding 

Strategy 

Filter 

Types 

Network 

Structure 

RMSE 

. 

- 

BPN(4-4-1) 

1.754525 

Data  Compression 

Hard 

Thresholding 

LP 

BPN(4-4-l) 

1.774456 

LP&HP 

BPN(8-8-1) 

1.759434 

Data  Denoising 

Soft 

Thresholding 

LP 

BPN(4-4-1) 

1.774456 

LP&HP 

BPN(8-8-1) 

1.759434 

Table  6.  Best  Basis  Selection  and  Best  Level  Technique  Performance  Using  Test  samples 


Criteria 

Contents 

BPN  Structure 

RMSE 

Best 

Orthogonal 

Basis 

Coifman  and 
Wickerhauser  (1992) 

1  LP  1 

(4-4-1) 

1.764243 

(8-8-1) 

1.74329 

Best  Level 

Coifman  et  al.  (1994) 

LP 

(4-4-1) 

1.767424 

LP&HP 

_ _ 

1.748388 

6  Concluding  Remarks 

Our  research  was  motivated  by  a  few  problems  central  in  time  series  analysis,  i.e.  how 
to  extract  non-stationary  signals  which  may  have  abrupt  changes,  such  as  level  shifts, 
in  the  presence  of  impulsive  outlier  noise  under  short-term  financial  time  series.  Our 
research  indicates  that  a  wavelet  approach  is  basically  an  attractive  alternative, 
offering  a  very  fast  algorithm  with  good  theoretical  properties  and  predictability  in 
financial  forecasting  model  design. 

From  our  experimental  results,  wavelet  shrinkage  or  denoising  has  also  been 
theoretically  proven  to  be  nearly  optimal  from  the  following  perspective:  spatial 
adaptation,  estimation  when  local  smoothness  is  unknown,  and  estimation  when 
global  smoothness  is  unknown  (Taswell  [46]).  In  the  future,  the  availability  of  these 
techniques  will  be  promising  more  and  more  according  to  the  domain  features. 
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Abstract.  This  article  deals  with  the  possible  computer  applications  of 
the  Sound  Approach  to  English  phonetic  alphabet.  The  authors  review 
their  prelimincury  research  into  some  of  the  more  promising  approaches 
to  the  application  of  this  phonetic  ^dphabet  to  the  processes  of  machine 
learning,  computer  speU-checking,  etc.  Applying  the  mathematical  ap¬ 
proach  of  rough  sets  to  the  development  of  a  data-based  spelling  recog¬ 
nizer,  the  authors  delineate  the  paffameters  of  the  international  coopera¬ 
tive  research  project  with  which  they  have  been  engaged  since  1997,  and 
point  the  direction  of  both  the  continuation  of  the  current  project  and 
of  future  studies,  as  well. 


1  Introduction 

In  1993-1994,  the  first  author  developed  and  did  initial  testing  on  a  new  system 
of  phonetic  spelling  of  the  sounds  in  English  as  an  aid  to  learning  better  English 
pronunciation  and  improving  listening  and  spelling  skills  in  English  for  Japanese 
students  of  English.  The  method,  subsequently  entitled  Sound  Approach  was 
tested  initially  on  Japanese  high  school  and  university  students.  The  results  of 
the  testing  indicated  that  the  creation  of  a  sound  map  of  English  was  very  help¬ 
ful  in  overcoming  several  common  pronunciation  difficulties  faced  by  Japanese 
learners  of  English  as  well  as  improving  their  English  listening,  sight  reading, 
and  spelling  skills  [1].  It  was  further  tested  on  Japanese  kindergarten  children 
(ages  3-6),  primary  school  pupils  (ages  6-11),  and  Russian  primary  school  pupils 
(ages  9-10)  and  secondary  school  students  (ages  11-13)  with  similar  results  [2-3]. 
It  was  further  tested  on  a  wide  range  of  international  ESL  (English  as  a  Second 
Language)  students  at  the  University  of  Regina.  These  latest  results,  while  still 
preliminary,  indicate  that  it  is  an  effective  and  useful  tool  for  helping  any  non¬ 
native  speaker  of  English  to  overcome  pronunciation  and  orthographic  barriers 
to  the  effective  use  of  English.  The  current  stage  of  development  for  ESL/EFL 
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(English  as  a  Second  Language/  English  as  a  Foreign  Language)  includes  les¬ 
son  plans  for  teachers,  flip-cards  and  a  workbook  for  students,  and  laminated 
wall  charts.  The  next  stage  of  development  includes  interactive  CD-ROMs  and 
various  computer  applications. 

One  of  the  objectives  of  the  Sound  Approach  to  teaching  English  language 
is  the  development  of  a  spelling  recognition  system  for  words  expressed  in  a 
phonetic  alphabet  of  forty- two  symbols  known  as  the  Sound  Approach  Phonetic 
Alphabet  (SA).  The  SA  alphabet  represents  without  ambiguity  all  sounds  appear¬ 
ing  in  the  pronunciation  of  English  language  words,  and  does  so  without  using 
any  special  or  unusual  symbols  or  diacritical  marks;  SA  only  uses  normal  English 
letters  that  can  be  found  on  any  keyboard  but  arranges  them  so  that  consistent 
combinations  of  letters  always  represent  the  same  sound.  Consequently,  any  spo¬ 
ken  word  can  be  uniquely  expressed  as  a  sequence  of  SA  alphabet  symbols,  and 
pronounced  properly  when  being  read  by  a  reader  knowing  the  SA  alphabet.  Due 
to  representational  ambiguity  and  the  insufficiency  of  English  language  charac¬ 
ters  to  adequately  and  efficiently  portray  their  sounds  phonetically  (i.e.,  there 
are  between  15  and  20  English  vowel  sounds  depending  on  regional  dialect,  but 
only  five  letters  to  represent  them  in  traditional  English  orthography),  the  rela¬ 
tionship  between  a  word  expressed  in  SA  alphabet  and  its  possible  spellings  is 
one  to  many.  That  is,  each  SA  sequence  of  characters  can  be  associated  with  a 
number  of  possible,  homophonic  sequences  of  English  language  characters.  How¬ 
ever,  within  a  sentence  usually  only  one  spelling  for  a  spoken  word  is  possible. 
The  major  challenge  in  this  context  is  the  recognition  of  the  proper  spelling  of 
a  homophone/homonym  given  in  SA  language.  Automated  recognition  of  the 
spelling  has  the  potential  for  development  of  SA-based  phonetic  text  editors 
which  would  not  require  the  user  to  know  the  spelling  rules  for  the  language 
but  only  being  able  to  pronounce  a  word  within  a  relatively  generous  margin 
of  error  and  to  express  it  in  the  simple  phonetic  SA-based  form.  Computerized 
text  editors  with  this  ability  would  tremendously  simplify  the  English  language 
training  process,  for  example,  by  focusing  the  learner  on  the  sound  contents  of 
the  language  and  its  representation  in  an  unambiguous  form  using  SA  symbols, 
and  in  a  wider  sense,  allow  for  more  equal  power  in  the  use  of  English  by  any 
native  or  non-native  speaker  of  English. 

2  Approach 

The  approach  adapted  in  this  project  would  involve  the  application  of  the  math¬ 
ematical  theory  of  rough  sets  in  the  development  of  a  data-based  word  spelling 
recognizer.  The  theory  of  rough  sets  is  a  collection  of  mathematical  tools  mainly 
used  in  the  processes  of  decision  table  derivation,  analysis,  decision  table  reduc¬ 
tion  and  decision  rules  derivation  from  data  (see,  for  instance  references  [4-9]). 
In  the  word  spelling  recognition  problem,  one  of  the  difficulties  is  the  fact  that 
many  spoken  words  given  in  SA  form  correspond  to  a  number  of  English  language 
words  given  in  a  standard  alphabet.  To  resolve,  or  to  reduce  this  ambiguity,  the 
context  information  must  be  taken  into  account.  That  is,  the  recognition  proce- 
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dure  should  involve  words  possibly  appearing  before,  and  almost  certainly  after 
the  word  to  be  translated  into  standard  English  orthography.  In  the  rough-set 
approach  this  will  require  the  construction  of  a  decision  table  for  each  spoken 
word.  In  the  decision  table,  the  possible  information  inputs  would  include  context 
words  surrounding  the  given  word  and  other  information  such  as  the  position 
of  the  word  in  the  sentence,  and  so  on.  Identifying  and  minimizing  the  required 
number  of  information  inputs  in  such  decision  tables  would  be  one  of  the  more 
labor-intensive  parts  of  the  project.  In  this  part,  the  techniques  of  rough  sets, 
supported  by  rough-set  bas  ed  analytical  software  such  as  KDD-R  [10-11],  would 
be  used  in  the  analysis  of  the  classificatory  adequacy  of  the  decision  tables,  and 
their  minimization  and  extraction  of  classification  (decision)  rules  to  be  used  in 
the  spelling  recognition.  It  should  be  emphasized  at  this  point,  that  the  process 
of  minimization  and  rule  extraction  would  be  automated  to  a  large  degree  and 
adaptive  in  the  sense  that  inclusion  of  new  spoken  word-context  combinations 
would  result  in  regeneration  of  the  classification  rules  without  human  inter¬ 
vention.  In  this  sense  the  system  would  have  some  automated  learning  ability 
allowing  for  continuous  expansion  as  more  and  more  experience  is  accumulated 
while  being  used. 

3  Rough  Sets 

The  theory  of  rough  sets  and  their  application  methodology  has  been  under 
continuous  development  for  over  15  years  now.  The  theory  was  originated  by 
Zdzislaw  Pawlak  [4]  in  the  1970's  as  a  result  of  long  term  fundamental  research 
on  logical  properties  of  information  systems,  carried  out  by  himself  and  a  group 
of  logicians  from  the  Polish  Academy  of  Sciences  and  the  University  of  Warsaw, 
Poland.  The  methodology  is  concerned  with  the  classificatory  analysis  of  impre¬ 
cise,  uncertain  or  incomplete  information  or  knowledge  expressed  in  terms  of 
data  acquired  from  experience.  The  primary  notions  of  the  theory  of  rough  sets 
are  the  approximation  space  and  lower  and  upper  approximations  of  a  set.  The 
approximation  space  is  a  classification  of  the  domain  of  interest  into  disjointed 
categories.  The  classification  formally  represents  our  knowledge  about  the  do¬ 
main,  i.e.,  knowledge  is  understood  here  as  an  ability  to  characterize  all  classes 
of  the  classification,  for  example,  in  terms  of  features  of  objects  belonging  to  the 
domain.  Objects  belonging  to  the  same  category  are  not  distinguishable  which 
means  that  their  membership  status  with  respect  to  an  arbitrary  subset  of  the 
domain  may  not  always  be  clearly  definable.  This  fact  leads  to  the  definition 
of  a  set  in  terms  of  lower  and  upper  approximations.  The  lower  approximation 
characterizes  domain  objects  about  which  it  is  known  with  certainty,  or  with  a 
controlled  degree  of  uncertainty  [7-8]  that  they  do  belong  to  the  subset  of  inter¬ 
est,  whereas  the  upper  approximation  is  a  description  of  objects  which  possibly 
belong  to  the  subset.  Any  subset  defined  through  its  lower  and  upper  approxi¬ 
mations  is  called  a  rough  set  .The  main  specific  problems  addressed  by  the  theory 
of  rough  sets  are: 

-  representation  of  uncertain,  vague  or  imprecise  information; 


—  empirical  learning  and  knowledge  acquisition  from  experience; 

—  decision  table  analysis; 

—  evaluation  of  the  quality  of  the  available  information  with  respect  to  its 
consistency  and  presence  or  absence  of  repetitive  data  patterns; 

—  identification  and  evaluation  of  data  dependencies; 

—  approximate  pattern  classification; 

—  reasoning  with  uncertainty; 

—  information-preserving  data  reduction. 

A  number  of  practical  applications  of  this  approach  have  been  developed  in 
recent  years  in  areas  such  as  medicine,  drug  research,  process  control  and  others. 
One  of  the  primary  applications  of  rough  sets  in  artificial  intelligence  (AI)  is  for 
the  purpose  of  knowledge  analysis  and  discovery  in  data  [6] .  Several  extensions 
of  the  original  rough  sets  theory  have  been  proposed  in  recent  years  to  better 
handle  probabilistic  information  occurring  in  empirical  data,  and  in  particular 
the  variable  precision  rough  sets  (VPRS)  model  [7-8]  which  serves  as  a  basis  of 
the  software  system  KDD-R  to  be  used  in  this  project.  The  VPRS  model  extends 
the  original  approach  by  using  frequency  information  occurring  in  the  data  to 
derive  classification  rules. 

In  practical  applications  of  rough  sets  methodology,  the  object  of  the  analysis 
is  a  flat  table  whose  rows  represent  some  objects  or  observations  expressed  in 
terms  of  values  of  some  features  (columns)  referred  to  as  attributes.  Usually,  one 
column  is  selected  as  a  decision  or  recognition  target,  called  a  decision  attribute. 
The  objective  is  to  provide  enough  information  in  the  table,  in  terms  of  attributes 
of  a  sufficient  number  and  quality,  and  a  sufficient  number  of  observations,  so 
that  each  value  of  the  decision  attribute  could  be  precisely  characterized  in  terms 
of  some  combinations  of  various  features  of  observations.  The  methodology  of 
rough  sets  provides  a  number  of  analytical  techniques,  such  as  dependency  anal¬ 
ysis,  to  asses  the  quality  of  the  information  accumulated  in  such  table  (referred 
to  as  a  decision  table).  The  decision  table  should  be  complete  enough  to  enable 
the  computer  to  correctly  classify  new  observations  or  objects  into  one  of  the 
categories  existing  in  the  table  (that  is,  matching  the  new  observation  vector  by 
having  identical  values  of  conditional  attributes).  Also,  it  should  be  complete 
in  terms  of  having  enough  attributes  to  make  sure  that  no  ambiguity  would 
arise  with  respect  to  the  predicted  value  of  the  target  attribute  (which  is  the 
spelling  category  in  the  case  of  this  application).  One  of  the  advantages  of  the 
rough  sets  approach  is  its  ability  to  optimize  the  representation  of  the  classifica¬ 
tion  information  contained  in  the  table  by  computing  so-called  reduct,  that  is, 
a  minimal  subset  of  conditional  attributes  preserving  the  prediction  accuracy. 
Another  useful  aspect  is  the  possibility  of  the  extraction  of  the  minimal  length, 
or  generalized  decision  rules  from  the  decision  table.  Rules  of  this  kind  can  sub¬ 
sequently  be  used  for  decision  making,  in  particular  for  predicting  the  spelling 
category  of  an  unknown  sound. 


547 


Q 

D 

1 

U 

Q 

□ 

ESSi 

2 

□ 

□ 

□ 

3 

El 

El 

El 

4 

El 

□ 

□ 

mwm 

5 

El 

El 

a 

6 

El 

El 

□ 

7 

El 

□ 

D 

8 

El 

a 

B 

9 

El 

□ 

El 

El 

El 

B 

□ 

□ 

□ 

Table  1.  Classification  training  sentences  by  using  grammatical  categories 


In  the  current  preliminary  testing  of  SA,  a  selection  of  homonyms  were  put 
into  representative  “training”  sentences.  For  each  group  of  “confusing”  words  one 
recognition  table  was  constructed.  For  example,  one  decision  table  was  developed 
to  distinguish  spelling  of  sounding  similar  words  ade,  aid,  ate  and  eight  Some 
of  the  training  sentences  used  in  deriving  the  table  were  as  follows: 

“we  need  aid”,  ”she  is  a  nurse*s  aid”,  “we  ate  chicken  for  dinner”,  and  so  on. 
The  relative  word  positions  (relative  to  the  target  word)  in  the  sentences  were 
plying  the  role  of  attributes  of  each  sentence.  That  is,  attribute  -1  represented 
the  predecessor  of  the  target  word,  attribute  denoted  by  -2  was  the  next  preced¬ 
ing  word,  and  so  on.  Only  up  to  five  positions  preceding  the  target  word  were 
used  in  the  representation.  The  values  of  such  defined  attributes  were  grammati¬ 
cal  categories  of  the  words  appearing  on  particular  positions,  eg.  verb  (value=l), 
noun  (value=2),  etc.  These  values  were  then  used  to  synthesize  decision  tables 
by  categorizing  training  sentences  into  a  number  of  classes.  The  decision  tables 
were  subsequently  the  subject  of  dependency  analysis  and  reduction  to  elimi¬ 
nate  redundant  inputs.  For  instance,  an  exemplary  final  reduced  decision  table 
obtained  for  words  ade,  aid,  ate  and  eight  is  shown  in  Table  1. 

In  the  preliminary  experiments,  it  was  found  that  using  the  decision  ta¬ 
bles  the  computer  could  accurately  choose  the  correct  spelling  of  non-dependent 
homonyms  (i.e.,  those  homonyms  for  which  the  simple  grammatical  protocol  was 
unable  to  determine  the  correct  spelling  from  the  context)  83.3  percent  of  the 
time,  as  in  the  sentence.  The  ayes/eyes  have  it.  With  dependent  homonyms,  as 
in  the  sentence, aie  eight  meals,  the  computer  could  accurately  choose  the  correct 
spelling  more  than  98  percent  of  the  time. 

4  Major  Stages  of  the  Initial  Project 

The  initial  project  was  divided  into  the  following  major  stages  which,  depending 
on  funding,  could  have  significantly  shortened  time-frames: 

1 .  Construction  of  decision  tables  for  the  selected  number  of  English  language 
homonyms  or  homophones.  This  part  would  involve  research  into  possible 
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contexts  surrounding  the  selected  words  in  typical  sentences  and  their  repre¬ 
sentation  in  decision  table  format.  This  would  also  involve  rough  set  analysis, 
optimization  and  testing  (with  respect  to  completeness  and  prediction  ac¬ 
curacy)  of  the  constructed  tables  using  existing  software  systems  Dataquest 
[12,13]  or  KDD-R.  The  related  activity  would  be  the  extraction  of  classi¬ 
fication  rules  from  such  tables.  This  is  a  very  labor-intensive  part  of  the 
project  since  the  number  of  possible  homonyms  or  homophones  is  in  the 
range  of  approximately  3000.  The  time-frame  for  this  part  of  the  project  is 
approximately  two  years. 

2.  Editor  development  using  the  tables  constructed  in  Stage  1  as  a  main  compo¬ 
nent  of  the  spelling  recognition  system.  The  editor  would  have  some  learning 
capabilities  in  the  sense  of  being  able  to  automatically  acquire  new  feedback 
word  combinations  in  cases  of  unsuccessful  recognitions.  The  editor  will  be 
constructed  in  a  similar  pattern  to  Japanese  Romaji-Hiragana-Kanji  word 
processing  selection  tables.  The  estimated  time  for  this  stage  of  the  project 
is  approximately  one  year  to  construct  a  working  prototype  system  assuming 
two  full-time  programmers  would  be  involved  in  the  system  development. 

3.  This  stage  would  involve  both  system  testing  and  refinement,  going  through 
multiple  feedback  loops  until  satisfactory  system  performance  and  user  sat¬ 
isfaction  is  achieved.  The  system  would  be  tested  with  English  language 
students  at  Yamaguchi  University  and  other  international  locations.  The 
accumulated  feedback  would  be  used  to  retrain  and  enhance  the  system’s 
spelling  recognition  capabilities  and  to  refine  the  user’s  interface  to  make 
it  as  friendly  as  possible.  It  is  also  felt  that  using  SA,  it  can  be  adapted 
to  any  regional  pronunciation  style  (e.g.,  Australian,  British  Received,  In¬ 
dian,  Irish,  etc.)  by  offering  the  user  their  choice  of  keyboard’s  for  their 
particular  area.  For  example,  in  standard  International  Broadcast  English 
the  word  table  would  be  represented  in  SA  by  spelling  it  teibul ,  whereas  in 
Australian  English  it  could  be  represented  in  SA  by  spelling  it  taibul  and  the 
computer  would  still  offer  the  standard  orthographic  representation  of  table 
in  the  spell-checking  process  in  either  keyboard  format.  At  this  stage,  not 
only  could  it  be  used  as  an  ordinary  spell  checker,  but  could  be  programmed 
for  speech  as  well  so  that  the  user  could  have  the  word  or  passage  read 
and  spoken  by  the  computer  in  either  sound  spelling  or  in  regular  spelling. 
As  a  normal  spell  checker,  for  example,  it  would  be  difficult  to  distinguish 
between  the  words  bother  and  brother.  However,  with  speech  capacity,  the 
user  could  potentially  hear  the  difference  and  catch  the  mistake.  This  could 
also  become  an  excellent  teaching/learning  device  for  practicing  and  learning 
correct  pronunciation  whether  for  native  or  for  non-native  English  speakers. 

5  Conclusions 

In  the  initial  study  on  the  efficacy  of  the  Sound  Approach  phonetic  alphabet  in 

meeting  the  requirements  for  the  development  of  easily  accessible  and  accurate 
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computer  word  recognition  capability  conducted  at  the  University  of  Regina  in 
1997,  the  rough  set  model  was  used  to  construct  decision  tables  on  a  list  of 
various  English  homonyms.  It  was  found  that  the  Sound  Approach  phonetic 
alphabet  and  the  rough  set  model  were  quite  compatible  with  each  other  in 
determining  decision  tables  used  in  decision  making  for  predicting  the  correct 
spelling  of  a  word  written  either  phonetically  or  in  standard  English  orthography. 
It  was  found  in  preliminary  experiments  that  even  using  a  relatively  unrefined 
grammatical  protocol  and  decision  tables,  we  were  able  to  correctly  identify  the 
correct  spelling  of  non-dependent  homonyms  83.3  percent  of  the  time.  This  ac¬ 
curacy  rate  rivals  already  extant  forms  of  standard  spelling  recognition  systems. 
When  confronted  with  dependent  homonyms,  the  computer  could  accurately 
choose  the  correct  spelling  more  than  98  percent  of  the  time. 

It  is  felt  that  with  further  refining  of  the  grammatical  protocol  and  expan¬ 
sion  of  the  sample  sentences  using  the  approximately  3000  English  homonyms, 
a  spelling  recognition  system  could  be  constructed  that  would  allow  even  non¬ 
native  speakers  of  English  to  gain  equal  access  and  power  in  the  language.  Fur¬ 
ther,  this  would  be  but  one  of  the  necessary  building  blocks  for  the  construction 
of  a  total  voice  recognition  operating  system,  and  a  major  step  forward  in  com¬ 
puter  speech  technology.  It  is  also  considered  that  these  advancements  have 
considerable  commercial  possibilities  that  should  be  developed. 
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Abstract.  This  paper  describes  a  recognition  system  for  on-line  cursive 
handwriting  that  requires  very  little  initial  training  and  that  rapidly  learns,  and 
adapts  to,  the  handwriting  style  of  a  user.  Key  features  are  a  shape  analysis 
algorithm  that  determines  shapes  in  handwritten  words,  a  linear  segmentation 
algorithm  that  matches  characters  identified  in  handwritten  words  to  characters 
of  candidate  words,  and  a  learning  algorithm  that  adapts  to  the  user  writing 
style.  Using  a  lexicon  with  lOK  words,  the  system  achieved  an  average 
recognition  rate  of  813%  for  top  choice  and  91. 7%  for  the  top  three  choices. 


1  Introduction 

As  more  people  use  and  depend  on  computers,  it  is  important  that  computers  become 
easier  to  use.  Many  systems  for  handwriting  recognition  have  been  developed  in  the 
past  35  years  [1][4][5][6][7][8].  In  contrast  to  those  systems,  the  method  proposed  in 
this  paper 

*  Dispenses  with  extensive  training  of  the  type  required  for  Hidden  Markov  Models 
and  Time  Delay  Neural  Networks  [6][7],  Initialization  of  the  knowledge  base 
consists  of  providing  four  samples  for  each  character. 

*  Uses  a  shape  analysis  algorithm  that  not  only  supports  the  identification  of 
characters  but  also  allows  efficient  reduction  of  the  lexicon  to  a  small  list  of 
candidate  words  [1][4]. 

*  Uses  a  linear-time  segmentation  technique  that  optimally  matches  identified 
characters  of  the  handwritten  word  to  characters  of  a  candidate  word,  in  the  sense 
that  the  method  completely  avoids  premature  segmentation  selections  that  may  be 
made  by  some  techniques  [6]  [8]. 

*  Learns  not  only  fi-om  failure  but  also  fi*om  correctly  identified  words,  in  contrast  to 
other  prior  methods  [5]  [7] . 

*  The  dictionary  words  need  not  be  provided  in  script  form.  Thus,  switching  to  a  new 
vocabulary  becomes  very  simple,  requiring  merely  a  new  lexicon[6][7]. 


^  This  work  is  done  as  part  of  my  Ph.D.  study  under  Professor  Klaus  Truemper  in  the  AI  Lab 
of  The  University  of  Texas  at  Dallas,  and  funded  by  the  Office  of  Naval  Research. 
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2  Modules  of  The  System 

The  system  consists  of  three  modules.  The  preprocessing  module  accepts  as  input  the 
raw  pixel  sequence  of  a  handwritten  word  recorded  by  a  digitizing  tablet  and  converts 
it  to  a  sequence  of  feature  vectors  called  the  basic  code.  The  interpretation  module 
receives  the  basic  code  of  a  handwritten  word  as  input,  deduces  word  shapes,  selects 
from  a  lexicon  a  list  of  candidate  words,  and  from  these  candidates  deduces  by  a 
matching  process  the  interpretation.  The  learning  module  analyzes  the  correct  word, 
which  is  either  the  output  word  of  the  interpretation  module  or  the  intended  word 
supplied  by  the  user,  and  locates  opportunities  for  learning  from  misidentified  letters 
and  from  identified  letters  with  low  match  quality  values.  The  insight  so  obtained 
results  in  addition,  adjustment,  or  replacement  of  templates.  The  next  three  sections 
describe  the  preprocessing,  interpretation  and  learning  module  respectively. 


Fig .  1 .  An  example  of  handwritten  word  ‘help’  with  extracted  features  and  regions;  and 
possible  shapes  of  strokes. 


3  Preprocessing  Module 

An  on-line  handwriting  recognition  system  accepts  handwriting  from  a  digitizer.  Due 
to  technical  limitations  of  the  tablet,  the  raw  pixel  sequence  of  a  handwritten  word 
includes  imperfections  and  redundant  information.  We  first  delete  duplicate  pixels 
caused  by  a  hesitation  in  writing  and  interpolate  non-adjacent  consecutive  pixels 
caused  by  fast  writing,  to  produce  a  continuous  pixel  sequence.  We  then  identify 
pixels  with  particular  characteristics  such  as  local  maxima  and  local  minima.  We  also 
normalize  Ae  handwritten  word  and  extract  other  features  such  as  locations  of 
extrema,  shapes  of  strokes,  slopes  of  strokes,  curvatures  of  strokes,  connections  of 
strokes,  and  openings  associated  with  maxima  and  minima.  We  organize  these 
features  into  a  sequence  of  feature  vectors  called  basic  code  which  is  input  of  the 
interpretation  module.  The  left  part  of  Figure  1  shows  an  example  of  a  handwritten 
word  with  extracted  extrema  and  regions.  The  right  part  gives  some  sample  shapes  of 
strokes. 
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4  Interpretation  Module 

The  interpretation  module  takes  the  basic  code  as  input  and  interprets  it  as  some  word 
of  a  given  lexicon.  The  module  carries  out  that  task  as  follows.  It  initially  extracts  the 
shape  of  the  handwritten  word,  such  as  ascenders,  descenders  and  their  positions  with 
respect  to  the  baseline  of  the  word.  By  using  the  shape  information,  it  reduces  a  large 
reference  lexicon  to  a  list  of  candidates  which  have  the  same  shape  as  the  handwritten 
word.  For  each  candidate,  the  module  ceirries  out  the  following  steps.  First,  the 
module  identifies  letters  of  the  candidate  in  the  basic  code  using  template  matching 
and  computes  a  match  quality  for  each  identified  letter.  We  emphasize  that  the 
portions  of  the  basic  code  corresponding  to  identified  letters  can,  and  often  do, 
overlap.  Second,  for  each  contiguous  segment  of  basic  code  connecting  identified 
letters,  a  certain  length  is  computed.  Similarly,  for  the  unidentified  letters  of  the 
candidate,  a  certain  length  is  determined  as  well.  Third,  a  linear-time  segmentation 
algorithm  finds  an  optimal  matching  of  identified  characters  of  the  handwritten  word 
to  the  characters  of  the  given  candidate  word,  in  the  sense  that  the  matching 
maximizes  the  sum  of  the  match  quality  values  for  the  identified  letters  minus  the  sum 
of  the  length  differences  for  the  unidentified  letters.  Once  all  candidate  words  have 
been  processed,  the  optimal  matching  of  each  candidate  word  is  scored  and  the 
candidate  with  the  highest  score  is  selected  as  the  desired  word. 


5  Learning  Module 

The  learning  algorithm  adapts  the  system  to  a  specific  writing  style  by  learning  user 
behavior  and  updating  the  template  set.  User-adaptive  systems  reported  in  the 
literature  conduct  their  adaptive  processes  only  when  a  word  is  not  recognized 
correctly  [5] [7].  We  employ  a  more  elaborate  adaptive  learning  strategy.  The  system 
learns  the  user’s  writing  not  only  when  the  output  word  of  the  system  is  wrong,  but 
also  when  it  is  correct.  In  the  latter  case,  the  system  learns  individual  characters  or 
sub-strings  of  the  word  that  have  not  been  recognized  correctly. 

With  knowing  the  correct  word  of  a  handwritten  word,  which  is  either  the  output 
word  of  the  interpretation  module  confirmed  by  the  user,  or  the  intended  word 
supplied  by  the  user,  the  learning  module  analyzes  the  identified  segments  and 
unidentified  segments  of  the  basic  code  to  identify  errors  for  learn.  We  do  learning  on 
the  unidentified  segments  and  the  identified  segments  with  low  match  quality.  For 
each  learning  case,  the  learning  module  picks  up  one  of  the  following  three  methods 
subsequently: 

1 .  Adding  Ae  segment  of  basic  code  as  a  new  template  if  the  number  of  templates 
does  not  reach  the  maximum  allowed  in  the  systems. 

2.  Adjusting  the  parameters  of  a  template  so  that  the  match  quality  is  increased.  Such 
a  change  may  cause  the  template  less  often  occurrences  of  other  letters/strings. 
Hence,  we  evaluate  the  positive  and  negative  impact  of  such  adjustments  to  decide 
if  we  want  to  adjust  a  template  or  use  the  next  method. 

3.  Replacing  the  least  frequently  used  templates  by  the  basic  code  segment. 
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6  Experimental  Results 

The  handwriting  data  is  collected  using  the  Wacom  ArtZ  II  tablet  (140  samples  per 
second  and  100  lines  per  inch).  The  initial  set  of  templates  was  collected  from  one 
writer  who  did  not  participate  in  the  testing.  Test  data  were  collected  from  four 
writers.  The  user-independent  system  using  preprocessing  and  interpretation  modules 
had  an  average  recognition  rate  of  65.5%,  and  the  user-adaptive  system  using  three 
modules  reached  81.3%.  Thus,  the  learning  module  improved  the  average  system 
accuracy  by  15.8%. 

We  have  conducted  experiments  to  analyze  the  error  distribution.  Table  1  shows 
the  percentage  of  correct  words  appearing  in  different  ranges  using  the  user-adaptive 
system.  The  table  shows  that  the  system  always  determines  the  correct  shape  class. 
Hie  screen  process,  which  reduces  the  shape  class  to  a  small  list  of  candidates,  causes 
an  average  4%  error.  The  average  performance  at  the  top  1  choice  is  81.3%.  In  the 
experiment  of  the  top  three  choices,  the  average  performance  is  improved  to  91.7%. 
However,  the  average  recognition  rate  of  the  top  five  choices  is  92.5%  which  does 
not  improve  much  on  the  top  3  choices. 

Table  1.  Recognition  rates  of  the  sytem  on  different  criteria 


Writer 

Top  1 

Top  3 

Top  5 

Candidate 

list 

Shape 

Clase 

A 

84% 

93% 

93% 

96% 

100% 

B 

80% 

91% 

92% 

97% 

100% 

C 

75% 

89% 

91% 

95% 

100% 

D 

86% 

94% 

94% 

97% 

100% 

7  Conclusions 

This  paper  has  presented  a  new  approach  for  on-line  handwriting  recognition.  The 
framework  of  our  approach  dispenses  with  elaborate  training  of  the  type  required  for 
statistical  pattern  recognition.  Initialization  of  the  system  consists  merely  in  providing 
four  samples  for  each  character,  written  in  isolation  by  one  writer.  The  dictionary 
words  need  not  be  provided  in  script  form.  Thus,  even  switching  to  a  new  vocabulary 
becomes  very  simple,  requiring  merely  a  new  lexicon.  While  principles  underlying 
the  present  approach  are  general  enou^,  the  techniques  of  segmentation  and  learning 
are  particularly  well  suited  for  Roman  scripts.  Tests  have  shown  that  the  method  is 
robust  because  performance  does  not  degrade  significantly  even  when  words  written 
by  one  writer  are  interpreted  using  reference  characters  from  another. 

In  creating  a  complete  handwriting  interpretation  system,  one  must  decide  where 
effort  can  be  most  effectively  applied  to  increase  the  performance.  It  is  felt  that  in  this 
system,  the  effort  has  been  distributed  with  an  emphasis  on  the  work  of  the 
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interpretation  module.  The  preprocessing  module  could  be  improved  upon,  for 
example,  by  extracting  a  set  of  better  features  from  the  raw  pixel  sequence. 
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